Biological data networks and methods therefor

ABSTRACT

A network node including a network interface and a packet generator in communication with the network interface is disclosed. The packet generator is configured to generate a data packet including a first header containing network routing information, a second header containing header information pertaining to the biological sequence data, and a payload containing a representation of the biological sequence data relative to a reference sequence. The network node further includes a queue in communication with the network interface, the data packet being stored within the queue. The network node also includes a transmit controller for controlling transmission of the data packet over a network accessible through the network interface.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims the benefit of priority under 35 U.S.C.§119(e) of U.S. Provisional Patent Application Ser. No. 61/451,086,entitled BIOLOGICAL DATA NETWORK, filed on Mar. 9, 2011, of U.S.Provisional Patent Application Ser. No. 61/539,942, entitled SYSTEM ANDMETHOD FOR SECURE, HIGHSPEED TRANSFER OF VERY LARGE FILES, filed Sep.27, 2011, and of U.S. Provisional Patent Application Ser. No.61/539,931, entitled SYSTEM AND METHOD FOR FACILITATING NETWORK-BASEDTRANSACTIONS INVOLVING SEQUENCE DATA, filed Sep. 27, 2011, the contentof each of which is hereby incorporated by reference herein in itsentirety for all purposes. This application is related to U.S. Utilitypatent application Ser. No. 12/837,452, entitled METHODS AND SYSTEMS FORPROCESSING GENOMIC DATA, filed on Jul. 15, 2010, which claims priorityto U.S. Provisional Patent Application Ser. No. 61/358,854, entitledMETHODS AND SYSTEMS FOR PROCESSING GENOMICS DATA, filed on Jun. 25,2010, and to U.S. Utility patent application Ser. No. 12/828,234,entitled METHODS AND SYSTEMS FOR PROCESSING GENOMIC DATA, filed on Jun.30, 2010, which claims priority to U.S. Provisional Patent ApplicationSer. No. 61/358,854, entitled METHODS AND SYSTEMS FOR PROCESSINGGENOMICS DATA, filed on Jun. 25, 2010, the content of each of which ishereby incorporated by reference herein in its entirety for allpurposes. This application is also related to U.S. Utility patentapplication Ser. No. 13/223,077, entitled METHODS AND SYSTEMS FORPROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION, filed onAug. 31, 2011, and to U.S. Utility patent application Ser. No.13/223,084, entitled METHODS AND SYSTEMS FOR PROCESSING POLYMERICSEQUENCE DATA AND RELATED INFORMATION, filed on Aug. 31, 2011, and toU.S. Utility patent application Ser. No. 13/223,088, entitled METHODSAND SYSTEMS FOR PROCESSING POLYMERIC SEQUENCE DATA AND RELATEDINFORMATION, filed on Aug. 31, 2011, and to U.S. Utility patentapplication Ser. No. 13/223,092, entitled METHODS AND SYSTEMS FORPROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION, filed onAug. 31, 2011, and to U.S. Utility patent application Ser. No.13/223,097, entitled METHODS AND SYSTEMS FOR PROCESSING POLYMERICSEQUENCE DATA AND RELATED INFORMATION, filed on Aug. 31, 2011, thecontent of each of which is hereby incorporated by reference herein inits entirety for all purposes. This application is also related to U.S.Utility patent application Ser. No. ______, entitled BIOLOGICAL DATANETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S.Utility patent application Ser. No. ______, entitled BIOLOGICAL DATANETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S.Utility patent application Ser. No. ______, entitled BIOLOGICAL DATANETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S.Utility patent application Ser. No. ______, entitled BIOLOGICAL DATANETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S.Utility patent application Ser. No. ______, entitled BIOLOGICAL DATANETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S.Utility patent application Ser. No. ______, entitled BIOLOGICAL DATANETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, the disclosures ofwhich are herein incorporated by reference in their entirety.

FIELD

This application is generally directed to processing and networkingpolymeric sequence information, including biopolymeric sequenceinformation such as DNA sequence information.

BACKGROUND

Deoxyribonucleic acid (“DNA”) sequencing is the process of determiningthe ordering of nucleotide bases (adenine (A), guanine (G), cytosine (C)and thymine (T)) in molecular DNA. Knowledge of DNA sequences isinvaluable in basic biological research as well as in numerous appliedfields such as, but not limited to, medicine, health, agriculture,livestock, population genetics, social networking, biotechnology,forensic science, security, and other areas of biology and lifesciences.

Sequencing has been done since the 1956s, when academic researchersbegan using laborious methods based on two-dimensional chromatography.Due to the initial difficulties in sequencing in the early 1956s, thecost and speed could be measured in scientist years per nucleotide baseas researchers set out to sequence the first restriction endonucleasesite containing just a handful of bases. Thirty years later, the entire3.2 billion bases of the human genome have been sequenced, with a firstcomplete draft of the human genome done at a cost of about three billiondollars. Since then sequencing costs have rapidly decreased.

Today, the cost of sequencing the human genome is on the order of $5000and is expected to hit the $1000 mark later this year with the resultsavailable in hours, much like a routine blood test. As the cost ofsequencing the human genome continues to plummet, the number ofindividuals having their DNA sequenced for medical, as well as otherpurposes, will likely increase significantly. Currently, the nucleotidebase sequence data collected from DNA sequencing operations are storedin multiple different formats in a number of different databases.

Such databases also contain annotations and other attribute informationrelated to the DNA sequence data including, for example, informationconcerning single nucleotide polymorphisms (SNPs), gene expression, copynumber variations methylation sequence. Moreover, transcriptomic andproteomic data are also present in multiple formats in multipledatabases. This renders it impractical to exchange and process thesources of genome sequence data and related information collected invarious locations, thereby hampering the potential for scientificdiscoveries and advancements.

SUMMARY

In one aspect the disclosure relates to a method of conveying biologicalsequence data. The method includes generating a data packet including afirst header containing network routing information, a second headercontaining header information pertaining to the biological sequencedata, and a payload containing a representation of the biologicalsequence data relative to a reference sequence. The method also includesstoring the data packet in a queue in communication with a networkinterface. The method further includes transmitting the data packet overa network accessible through the network interface.

In another aspect the disclosure pertains to a method of receivingbiological sequence data. The method includes receiving, through anetwork interface of a network node, a data packet including a firstheader containing network routing information, a second headercontaining header information pertaining to the biological sequencedata, and a payload containing a compressed version of the biologicalsequence data. The method also includes providing the data packet to aninput packet processor in communication with the network interface. Atleast the compressed version of the biological sequence data is thenextracted from the data packet and stored within a memory of the networknode.

In a further aspect the disclosure pertains to a network node includinga network interface and a packet generator in communication with thenetwork interface. The packet generator is configured to generate a datapacket including a first header containing network routing information,a second header containing header information pertaining to thebiological sequence data, and a payload containing a representation ofthe biological sequence data relative to a reference sequence. Thenetwork node further includes a queue in communication with the networkinterface, the data packet being stored within the queue. In addition,the network node includes a transmit controller for controllingtransmission of the data packet over a network accessible through thenetwork interface.

In yet another aspect the disclosure pertains to a network nodeincluding a network interface and an input packet processor incommunication with the network interface. The input packet processor isconfigured to receive a data packet and extract at least a compressedversion of biological sequence data from the data packet wherein thedata packet includes a first header containing network routinginformation, a second header containing header information pertaining tothe biological sequence data, and a payload containing the compressedversion of the biological sequence data. The network node furtherincludes a memory in which is stored the compressed version of thebiological sequence data.

BRIEF DESCRIPTION OF THE DRAWINGS

Various objects and advantages and a more complete understanding of thedisclosure are apparent and more readily appreciated by reference to thefollowing Detailed Description and to the appended claims when taken inconjunction with the accompanying Drawings wherein:

FIG. 1 is a representation is provided of a biological data unitcomprised of a payload containing DNA sequence data and aBioIntelligence header containing information having biologicalrelevance to the DNA sequence data within the payload.

FIG. 2 illustratively represents a biological data model which includesa plurality of interrelated layers.

FIG. 3 depicts a biological data unit having a BioIntelligence headerand a payload containing an instruction-based representation ofsegmented DNA sequence data.

FIG. 4 is a logical flow diagram of a process for segmentation ofbiological sequence data and combining the segments with metadataattributes to form biological data units encapsulated withBioIntelligence headers.

FIG. 5 depicts a biological data network comprised of representations ofbiological data linked and interrelated by an overlay network containinga plurality of network nodes.

FIG. 6 illustrates an exemplary protocol stack implemented at a networknode together with corresponding layers of the OSI network model.

FIG. 7 shows a high-level view of various data types that may beprocessed by a group of network nodes in response to a query/requestreceived from a client terminal.

FIG. 8 provides a block diagrammatic representation of the architectureof an exemplary network node.

FIG. 9A illustratively represents a process effected by a network nodeto implement a sequence variants processing procedure.

FIG. 9B is a flowchart of an exemplary variants processing procedure.

FIG. 10 illustratively represents the processing occurring at a networknode configured to perform a specialized processing function.

FIG. 11 provides a representation of an exemplary processing platformcapable of being configured to implement a network node.

FIG. 12 illustrates one manner in which data may be processed, managedand stored at an individual network node in an exemplary clinicalenvironment.

FIGS. 13-18 illustratively represent the manner in which informationwithin the layered data structure is utilized at an individual networkprocessing node.

FIG. 19 depicts a Smart Repository™ configured to retrieve and aggregategenomic-related and other data relevant to the interests of actorsinteracting with the Smart Repository™.

FIG. 20 depicts a Smart Repository™ which includes a SmartTracker™module and a transactor.

FIG. 21 illustrates an implementation of a Smart Repository™ whichincludes a SmartTracker™ module, a transactor and a transcriptor.

FIG. 22 is a flowchart representative of exemplary interaction betweenan actor and a Smart Repository™

FIG. 23 illustrates an alternate implementation of a Smart Repository™including a GeneTransfer Executive module and a transcriptor.

FIG. 24 depicts an exemplary implementation of an actor configured tointeract as a client with a Smart Repository™

FIG. 25A and FIG. 25B collectively provide a more detailedrepresentation of exemplary process performed by a Smart Repository™ inprocessing a request from an actor.

FIG. 26 is a flowchart representative of an exemplary process forranking attributes appearing within relevant metadata files in order toidentify a set of high-ranking attributes relative to a query and/orrelated query processing results.

FIG. 27 is a flowchart representative of an exemplary manner in whichnetwork nodes of a biological data network may cooperate to process aclient request.

FIG. 28 is a flowchart representative of an exemplary sequence ofoperations involved in the identification and processing of sequencevariants at a network node.

FIG. 29 is a flowchart representative of an exemplary sequence ofoperations carried out by network nodes of a biological data network inconnection with processing of a disease-related query.

FIG. 30 is a flowchart representative of an exemplary sequence ofoperations involved in providing pharmacological response data inresponse to a user query concerning a specified disease.

FIG. 31 shows a flowchart representative of a manner in whichinformation relating to various different layers ofbiologically-relevant data organized consistently with a biological datamodel may be processed at different network nodes.

DETAILED DESCRIPTION Introduction

This disclosure relates generally to an innovative new biological datanetwork and related methods capable of efficiently handling the massivequantities of DNA sequence data and related information expected to beproduced as sequencing costs continue to decrease. The disclosed networkand approaches permit such sequence data and related medical or otherinformation to be efficiently stored in data containers provided ateither a central location or distributed throughout a network, andfacilitate the efficient network-based searching, transfer, processing,management and analysis of the stored information in a manner designedto meet the demands of specific applications.

The disclosed approaches permit such sequence data and any relatedmedical, biological, referential or other information, be it computed,human-entered/directed or a combination thereof, to be efficientlytransmitted and/or shared or otherwise conveyed from a centralizedlocation or either partly or wholly distributed throughout thebiological data network. These approaches also facilitate data formatsand encodings used in the efficient processing, management and analysisof various “omics” (i.e., proto/onco/pharma) information. The innovativenew biological data network or, equivalently, BioIntelligence network,is configured to operate with respect to biological data units stored atvarious network locations.

Each biological data unit will generally be comprised of one or moreBioIntelligence headers associated with or relating to a payloadcontaining a representation of segmented DNA sequence data or othernon-sequential data of interest. The term header in this context refersto one or more pieces of information that have relevance to the payload,without regard to how or where such information is physically stored orrepresented within the BioIntelligence network. As is discussed below,it will be appreciated that certain operations performed by the nodes orelements of the biological data network may be effected with respect tothe entirety of the biological data units undergoing processing; thatis, with respect to representations of both the segmented sequence dataand BioIntelligence headers of such biological data units.

However, the elements of the biological data network may perform otheroperations by, for example, comparing or correlating only theBioIntelligence headers of the biological data units being processed. Inthis way network bandwidth may be conserved by obviating the need fornetwork transport of segmented biological sequence data, or somerepresentation thereof, in connection with various processing operationsinvolving biological units nominally stored at different networklocations.

The biological data network may be comprised of a plurality of networknodes configured with processing and analytical capabilities, which areindividually or collectively capable of responding to machine or userqueries or requests for information. As is discussed below, thefunctionality of the new biological data network may be integrated intothe current architectural framework of the Open Systems Interconnection(OSI) seven-layer model and the Transmission Control Protocol andInternet Protocol (TCP/IP) model for network and computingcommunications. This will allow service providers to configure existingnetwork infrastructure to accommodate biological sequence data todeliver optimized quality of service for medical and healthprofessionals practicing genomics-based personalized medicine.Alternatively or in addition, the new biological data network may berealized as an Internet-based overlay network capable of providingbiological, medical and health-related intelligence to applicationssupported by the network.

The new biological data network facilitates overcoming the dauntingchallenges associated with analysis of various pertinent omics datatypes together with, and in the context of, all relevant, availableprior knowledge. In this regard the new biological data network mayfacilitate development of an integrated ecosystem in which distributeddatabases are accessible on a network and in which the data storedtherein is configured to be linked by BioIntelligence. This newbiological data network may enable, for example, forming, securing,linking, searching, filtering, sorting, aggregating and connecting anindividual's genome data with a layered data model of existing knowledgein order to facilitate extraction of new and meaningful information.

Overview of Biological Data Units and BioIntelligence Headers

As disclosed herein, the innovative new biological data network isconfigured to operate with respect to biological data units stored atvarious network locations. Biological data units can be considered as aset of information that is known or can be predicted to be associatedwith certain segments of genome sequences. Biological data units willgenerally be comprised of one or more BioIntelligence headers associatedwith or relating to a payload containing a representation of segmentedDNA sequence data or other non-sequential data of interest.

The biological data units may be generated by dividing source DNAsequences into segments and associating one or more BioIntelligenceheaders (also referred to herein as “BI headers” or annotations orattributes) with one or more segments of genome sequence data. Thevarious component parts XML metadata files that are of the headerinformation contained in biological data units can be stored indistributed storage containers that are accessible on a network.Furthermore, the different segments of a whole genome sequence datacontained in the payload of biological data units may be stored inmultiple BAM files at various different locations on a network.

Each BI header can be considered a specific piece of information or setof information that may be associated with or have biological relevanceto one or more specific segments of DNA sequence data within the payloadof the biological data unit. It should be appreciated that anyinformation that is relevant to the segmented sequence data payload of abiological data unit can be placed in the one or more BioIntelligenceheaders of the data unit or, as is discussed below, withinBioIntelligence headers of other biological data units. It should alsobe clearly understood that the information contained in any biologicaldata unit can be highly distributed and network linked in such a mannerthat allows filtration and dynamic recombination of any permutation ofassociated attributes and sequence segments.

The BioIntelligence headers may be arranged in any order, whetherdependent upon or independent of the payload data. However, in oneembodiment the BioIntelligence headers are each respectively associatedwith at least one layer of a biological data model of existing knowledgethat is representative of the biological sequence data which, forexample, may be stored as BAM files within the payloads of thedistributed biological data units with which such headers or XMLmetadata attributes are associated.

Although the present disclosure provides specific examples of the use ofBI headers in the context of a layered data model, it should beunderstood that BI headers may be realized in essentially any formcapable of embedding information within, or associating such informationwith, all or part of any biological or other polymeric sequence orplurality thereof. For example, one or more BI headers could beassociated with any permutation of segments of DNA sequence or othersuch polymeric sequence or within any combination thereof, in any analogor digital format.

The BI headers could also be placed within a representation ofassociated polymeric sequence data, or could be otherwise associatedwith any electronic file or other electronic structure representative ofmolecular information. In other words, the one or more metadataattributes that are stored in multiple storage containers on a networkmay compose BioIntelligence headers that are specifically associatedwith at least one segment of sequence contained in a file transfersession.

In the case in which BioIntelligence data is embedded within DNA orother biological sequence information, the BI headers or tags includingthe BioIntelligence data may be placed in front of, behind or in anyarbitrary position within any particular segmented sequence data ormultiple segmented data sequences. In other words, in one particularembodiment of the invention, information that is associated directly orindirectly may be stored within the base calls of reads that arecontained in BAM files or any other sequence file format or internalmemory structures, for example. This approach would involve a method forintegrating, at least one specific attribute of information that isassociated with a genome sequence between and or among the base callscontained within reads of sequence data files.

In addition, the BioIntelligence data may be embedded in a contiguous ordisbursed manner among and within the base calls of the segmentedsequence data. When this highly structured and layered approach isapplied to the storage configuration of this sequence data andassociated information it will advantageously facilitate thecomputationally efficient, effective and rapid analysis of, for example,the massive quantities of genome sequence data being generated bynext-generation, high-throughput DNA sequencing machines.

In particular, distributed biological data units containing segmentedDNA sequence data and associated attributes may be stored, sorted,filtered and operated on for various scope and depth of analysis basedupon the said associated information which is contained within theBioIntelligence headers. This obviates the need to manipulate, transferand otherwise breach the security of the segmented DNA sequence data inorder to process and analyze such data.

One embodiment of the layered data model of the existing body ofrelevant knowledge includes not only BioIntelligence of or pertaining tobiologically-relevant data but also other metadata which are associatedwith the nucleic acid sequence files. Such MetaIntelligence™ metadatamay include, for example, facts, information, knowledge and predictionderived from biological, clinical, pharmacological, environmental,medical or other health-related data, including but not limited to otherbiological sequence data such as methylation sequence data as well asinformation on differential expression, alternative splicing, copynumber variation and other related information.

The DNA sequence information included within the biological data unitsdescribed herein may be obtained from a variety of sources. For example,DNA sequence information may be obtained “directly” from DNA sequencingapparatus, as well as from sequence data files that are stored inprivate and publicly accessible genome data repositories. Additionally,it may be computationally derived and/or manually gathered or inferred.In the case of the database of Genotypes and Phenotypes at the NationalCenter for Biotechnology Information at the National Library ofMedicine, the DNA sequence entries may be stored as BAM, SRF, fastq aswell as in the FASTA format, which includes annotated informationconcerning the sequence data files. In one embodiment certain of theinformation contained within the one or more BioIntelligence headers ofeach biological data unit would be obtained from publicly accessibledatabases containing genome data sequences.

Turning now to FIG. 1, a representation is provided of a biological dataunit comprised of a payload containing DNA sequence data and aBioIntelligence header containing information having biologicalrelevance to the DNA sequence data within the payload. Furthermore, itshould be appreciated that information contained in a particularBioIntelligence header may also point or associate with sequence datathat is stored in at least one data container as the payload portion ofbiological data units.

In addition, it should be understood that the BioIntelligence headerinformation and sequence payload that is contained within biologicaldata units relate directly to attributes in XML metadata files and BAMsequence files, respectively. Any key value can associate with one ormore sequence files or segments of sequence within such files. In oneparticular aspect of the disclosed approach, the key value may beinformation of or pertaining to a drug or its effect and the sequencemay be a segment of sequence contained in a genomic sequence object filetransfer session.

The BioIntelligence header information may associate with or relate tofor example a microRNA sequence or the regulatory region of a gene orinteraction with another gene product from at least one molecularpathway. Since the example that is presented as FIG. 1 shows that thepayload contains DNA sequence data, the biological data unit of FIG. 1may also be referred to herein as a DNA protocol data unit (DPDU). TheDPDU can be considered as distributed biological data units that areencapsulated with information for transfer, control and other data thatis relevant to the protocol.

In one embodiment, the exemplary biological data unit that is depictedin FIG. 1 would be associated with the DPDUs that are encapsulated andinvolved in a computer-implemented method for processing data units. Forexample, in the case where the sequence payload is RNA sequence datawhich may be derived from RNA-seq or deduced from the DNA sequence datacould be included within RNA protocol data units (RPDU) comprised of aplurality of RNA specific BioIntelligence headers and a payloadcomprised of the RNA sequence data. The BioIntelligence headerinformation contained in distributed components of RPDUs may include butnot be limited to information on differential expression, splicing,processing and other posttranscriptional modifications of RNA.

Similarly, a protein protocol data unit (PPDU) comprised ofpeptide-specific BioIntelligence headers and a payload containing arepresentation of amino acid sequence data. The biological sequence datathat is contained in the payload of PPDUs may be from massspectrophotometry protein sequencing data or deduced from the DNAsequence data of the DPDU of FIG. 1. Furthermore, the BioIntelligenceheader information may be information such as the protein'sconcentration in body fluids or the extent of protein activity whichcould also be associated with the DPDU(s) of the representative gene.

Attention is now directed to FIG. 2, which illustrates a concept for abiological data model which is representative of the associationsbetween and among layers of existing knowledge as well as the intra andinterrelationships that exist among and between the highly distributedbiological data units described above. In particular, theBioIntelligence headers consisting of information pertaining to theDNA-specific, RNA-specific and peptide specific biological data unitsare each associated with at least one of the “layers” of the biologicaldata model of FIG. 2, i.e., the DNA, RNA and peptide layers,respectively.

Alternatively, a given biological data unit which may be stored inmultiple storage containers may comprise a payload containing arepresentation of biological sequence data and a plurality ofBioIntelligence headers, each of which is associated with one or more ofthe layers of the biological data model of FIG. 2. As is discussedbelow, although each BioIntelligence header may be characterized asbeing associated with a certain layer of a data model, each may alsopoint to or otherwise reference information in the BioIntelligenceheader or payload of a separate biological data unit that may be storedin multiple storage containers may further be associated with adifferent layer of the biological data model.

BioIntelligence headers may be associated with any form of intelligenceor information capable of being represented as headers, tags or otherparametric information which relates to the biological sequence datawithin the payload of a biological data unit. Alternatively oradditionally, BioIntelligence headers may point to relevant or unique(or arbitrarily assigned for the processing purpose) information that isassociated with the biological sequence data within the payload.

A BioIntelligence header may be associated with any information which iseither known or predicted based upon scientific evidence, and may alsoserve as a placeholder for information which is currently unknown butwhich later may be discovered or otherwise becomes known. For example,such information may include any type of information related to thesource biological sequence data including, for example, analytical orstatistical information, testing-based data such as gene expression datafrom microarray analysis, theories or facts based on research andstudies (either clinical or laboratory), or information at the communityor population level based study or any such related observation from thewild or nature.

In one embodiment relevant information concerning a certain segment ofDNA sequence or biological sequence data may be considered metadata andcould, for example, include clinical, pharmacological, phenotypic orenvironmental data capable of being embedded and stored in more than onestorage container but with very close association with the sequence dataas part of the payload or included within a look-up table.

One distinct advantage to storing metadata and sequence files in amanner that allows for effective and robust tracking and linking of thedata is that it enables DNA and other biological sequences that make uplarge data files to be more efficiently processed and managed. The typeof information that may be embedded or associated with segments of DNAsequences or any other biological, chemical or synthetic polymericsequence can be represented in the form of packet headers, but any otherformat or method capable of representing this information in associationwith one or more segments of biological sequence data within a data unitis within the scope of the teachings presented herein.

The systems described herein are believed to be capable of facilitatingreal-time processing of biological sequence data and other related datasuch as, for example and without limitation, gene expression data,deletion analysis from comparative genomic hybridization, quantitativepolymerase chain reaction, quantitative trait loci data, CpG islandmethylation analysis, alternative splice variants, microRNA analysis,SNP and copy number variation data as well as mass spectrometry data onrelated protein sequence and structure. Such real-time processingcapability may enable a variety of applications including, for example,medical applications.

The types of medical applications that could be facilitated by thisapproach may include an automated computer-implemented algorithm thatallows the storing, filtering, sorting and tracking of an individual'swhole genome sequence in segments as they relate to all the attributesand annotations in association with a biological data model of existingknowledge to extract meaningful and relevant results to specificqueries. The processing and analysis of this data will unveil a newclass of rich BioIntelligence information that can be utilized inaccordance with the layered data model of prior knowledge.

BI headers may be used for the embedding of biologically relevantinformation, in full or in part, in combination with any polymericsequence or part or combination thereof, and may be placed at either endof such polymeric sequence or in association within any combination ofsuch polymeric sequences. In addition, embedded information can beconsidered to be information that is clustered and linked in such a waythat relevant information that is related to sequence data files arelinked to allow for precipitation of meaningful new insight.Furthermore, the various components of the metadata information andsequence segments can be accessible from multiple storage containers ona network.

BI headers may be configured to be in any format and may be associatedwith one or more segments of polymeric sequence data. Furthermore, incertain cases the components of biological data units may be stored in acentralized container and in such case the BI Headers may be positionedin front of or behind (tail) the polymeric sequence data, or at any setof arbitrary locations within the representation of the segmentedsequence data. Moreover, the BI headers may comprise contiguous stringsof information or may be themselves segmented and the constituentsegments placed (randomly or in accordance with a known pattern) amongand between the segments of sequence data which is comprised within oneor more biological data units.

The use of BI headers in representing genome sequence data in astructured format advantageously provides an enhanced capability forclassifying and filtering the sequence data based upon any of severalstored existing knowledge fields that are related to the said sequencesegment. This approach allows for the sequence data to be sorted basedon the abstracted descriptive information which is contained within theBI headers relating to the segmented sequence data of a specificbiological data unit.

For example, the segmented genome sequence data represented by aplurality of biological data units could be processed such that, aparticular gene that is normally known to be located at a certainposition on chromosome 1 could be sorted along with other genes or geneproducts from the same or a different chromosome if the correspondinggenes or gene products are associated with a particular molecularpathway, drug treatment, health condition, diagnosis, disease orphenotype. Alternatively, it should be known that certain chromosomalrearrangements could generate a similar result when a portion of onechromosome is transferred through translocation and becomes part ofanother.

In the general case not all of the segments of DNA sequence data withinthe set of biological data units resulting from segmentation of anindividual genome will directly associate with every field of theapplicable BI header attributes. For example, a certain biological dataunit may contain a segment of DNA sequence lacking an open readingframe, in which case the exon count field of the DNA-specific BI headerwould not be applicable. In any case, the particular header informationtype along with other header information types are maintained as placeholders for future scaling of the depth and scope of intelligence thatis contained within the XML metadata files. This permits biologicalinformation relating to the segmented DNA sequence data of a certainbiological data unit which is not yet known to be easily added to theappropriate layer of the biological data model once the informationbecomes known and, in certain cases, scientifically validated.

In certain exemplary embodiments disclosed herein, the biological orother polymeric sequence data contained within the payload of abiological data unit is represented in a two-bit binary format. However,it should be appreciated that other representations are within the scopeof the teachings herein. For example, the instruction set architecturedescribed in co-pending application Ser. No. 12/828,234 (the “'234application”) may be employed in certain embodiments described herein tomore efficiently represent and process the segmented genome sequencedata within the payload of biological data units. Accordingly, in orderto facilitate comprehension of these certain embodiments, a descriptionis provided below of certain aspects of the instruction set architecturedescribed in the '234 application.

Overview of Instruction Set Architecture for Polymeric SequenceProcessing

Set forth hereinafter are the general descriptions for the instructionset architectures comprised of instructions for processing biologicalsequences, as well as descriptions of associated biological sequenceprocessing methods and apparatus configured to implement theinstructions. The instructions may be recorded upon a computer storagemedium, and a sequence processing system may contain the storage mediaand a processing apparatus which can be configured to implement theprocessing and analysis that is defined by the set of instructions thatare designed specifically for operating on the associated attributes. Inaddition, a computer data storage product may contain sequence dataencoded using instruction-based encoding in order to generate abiologically relevant representation of the segmented genome sequencedata.

Also described herein is an article of manufacture in a system forprocessing biopolymeric information, where the article of manufacturecomprises a machine readable method for comparative sequence analysiswhich comprises an instruction set architecture that includes aplurality of instructions for execution by a processor, each of theplurality of instructions being at least implicitly defined relative toat least one controlled sequence, and representative of a biological,chemical, medical, pharmacological, clinical, environmental or physicalevent affecting one or more aspects of a biopolymeric molecule.

The plurality of instructions may include a set of operation codescorresponding to the biological event and an operand relating to atleast a portion of a monomeric unit of the biopolymeric sequence. Theone or more aspects may include a monomer of the biological polymermolecule. The event that affects the one or more aspects may include astructural representation of the biopolymeric molecule. The biopolymericmolecule may comprise a segment of a DNA molecule and the monomer maycomprise at least a portion of a nucleotide base of the DNA sequence.

Genomic-Based Instructions

Herein, genomic sequences are defined as sequences of data that ishandwritten or stored digitally and describes the genomiccharacteristics of a particular organism. The term “genomic” in generalwill refer to sequence data that both encode genes (also referred to as“genetic” data) as well as data that is believed to be non-coding.

The phrase “a particular organism” will mean the organism from whichcells were used to prepare DNA for sequencing. Cells will refer to alland any cell type that is integral to the particular organism includingnormal cells, and tumor cells, cell from plants and animals that may bein the digestive track of the organism. Furthermore, this will includebacteria, viruses and mobile DNA elements that are attached to theorganism on the outside or inside. The terms “bacteria” and “viruses”will refer specifically to detection of any evidence of these microbialorganisms DNA sequences which may be endogenous or exogenous.

The term “genome” will refer to an organism's entire hereditaryinformation. Genomic sequencing is the process of determining aparticular organism's genomic sequence. This term will further referencean organism's inheritable “genome” which will include methylationsequencing epigenomics data as well as microbiomics data and known orpredicted non-Mendelian trans-generational transmission of RNA sequencedata.

The human genome, as well as that of other organisms, can be generallythought of as being made of four chemical units called nucleotide bases(also referred to herein as “bases” for brevity). These bases areadenine (A), thymine (T), guanine (G) and cytosine (C). Double strandedsequences are made of paired nucleotide bases, where each base in onestrand normally pairs with a base in the other strand, according to theWatson-Crick pairing rules, i.e., A pairs with T and C pairs with G (InRNA, Thymine is replaced with Uracil (U), which pairs with A and lessoften with G).

A sequence is a series of bases, ordered as they are arranged inmolecular DNA or RNA. For example, a sequence may include a series ofbases arranged in a particular order, such as the following examplesequence fragment: ACGCCGTAACGGGTAATTCA. The human haploid genomecontains approximately 3.2 billion base pairs, which may be furtherbroken down into a set of 23 chromosomes. It is approximated that the 23chromosomes encode about 30,000 genes. While each individual's sequenceis different, there is much redundancy between individuals of aparticular genome, and in many cases there is also much redundancyacross similar species. For example, in the human genome the sequencesof two individuals are about 99.5% equivalent, and are therefore highlyredundant. Viewed in another way, the number of differences in bases insequences of different individuals is correspondingly small. Thesedifferences may include differences in the particular nucleotide at aposition in the sequence, also known as a single nucleotide polymorphismor SNP, as well as addition, subtraction, or rearrangement or repeats orany genetic or epigenetic variation of nucleotides between individuals'sequences at corresponding positions in the sequences.

Because of the enormous size of the human genome, as well as the genomesof many other organisms, storage and processing genomic sequences (whichare typically separate sequences generated from a particular individualor organism, but may also be a sequence fragment, sub-sequence, sequenceof a particular gene coding sequence or non-coding sequences betweengenes, etc.) creates problems with processing, analysis, memory storage,data transmission, and networking. Consequently, it is usuallybeneficial to store the sequences in as little space as possible. Atpresent, there are several well recognized efforts to achieve efficientmeans to facilitate the smallest footprint. Moreover, it is typicallyimportant that no information is lost in storage and transmission.Accordingly, processing for storage or transmission of whole or partialsequences should include removing redundant information in a sequence ina lossless fashion.

Variations in the DNA sequences of different individuals are a result ofdeviations (also known as mutations). For example, one particular typeof mutation may relate specifically to substitutions of nucleotide basesat common or certain reference positions in the sequence. A basesubstitution (also known as a point mutation) is the result of one basein a sequence at a particular position or reference location beingreplaced with a different one (relative to another sequence, which maybe a reference sequence from which other sequences are compared).

A base substitution can be either a transition (e.g., between G and A,or C and T) or a transversion (e.g., between G and its paired base C ora T, or between A and its paired base T or a C).

Representation of Polymeric Sequence Data Using Biological Data Units

One aspect the present disclosure describes an innovative methodologyfor biological sequence manipulation well-suited to address thedifficulties that are related to the processing comparative sequenceanalysis of large quantities of DNA sequence data. The disclosedmethodologies enable segmented representations of such sequence data tobe efficiently stored (either locally or in a distributed fashion),searched, moved, processed, managed and analyzed in an optimal manner inlight of the demands of specific applications.

The disclosed method involves breaking whole genome DNA sequence entriesinto deliberate segments and packetizing the fragments in associationwith BioIntelligence header information to form biological data units.In one embodiment much of the BioIntelligence header information may beobtained from private or public databases containing informationpertaining to involved molecular pathways, drug databases, publishedresearch data that can be found in well-established databases such as,for example, dbGaP and EMBL. The DNA sequence entries within many publicdatabases may be stored in a BAM file format, which accommodates theinclusions of annotated information concerning the sequence. Forexample, an entry for a DNA sequence recorded in the BAM file formatcould include annotated information identifying the name of the organismfrom which the DNA was isolated and the gene or genes contained in thespecific sequence entry.

Alternatively, the sequence file may contain the base sequenceinformation while the ancillary metadata information could be containedin XML files as specific attributes that are associated with aparticular segment of the sequence. The associated information that iscontained in these files may relate with prior knowledge that isconfigured in a biological model that is consistent with a layered datamodel.

In addition, the information that is pertinent to which chromosome theparticular DNA sequence segment was obtained and the starting and endingbase positions of the sequence would also typically be available.Furthermore, other public and private databases include informationrelating to, for example, the location of human CpG islands and theirmethylation sequence, as well as the genes with which such islands areassociated (see, e.g., http://data.microarrays.ca/cpg/index.htm)

For each identifiable gene there will be an essential need for a normalcontrol state of the particular gene. Database entries that containgenes that are identified as being associated with a RefSeqGene, whichpertains to a project within NCBI's Reference Sequence (RefSeq) project,provide another potential source of BioIntelligence header information.The RefSeqGene project defines the DNA sequences of genes that arewell-characterized by leaders in the scientific community to be used asreference standards which is a part of the Locus Reference Genomic (LRG)project. In particular, sequences labeled with the keyword RefSeqGeneserve as a stable foundation for reporting mutations, for establishingconventions for numbering exons and introns, and for defining thecoordinates of other biologically significant variation. DNA sequenceentries that associate directly with the RefSeqGene will bewell-supported, exist in nature, and, to the extent for which it ispossible, represent a prevalent, ‘normal’ allele.

It should be appreciated that there may be different schemas forsegmentation and packetizing sequence entries in order to associate thehighly relevant attribute information with specific sequence segments.For example, in the case in which it is suitable to segment sequenceentries into packets containing genes or, alternatively, into intronsand exons, relevant data is available for placement into theBioIntelligence header information relating to the metadata attributesof the biological data units containing such sequence segments.

Biological Data Units Including BioIntelligence Headers

Referring again to FIG. 1, the BioIntelligence header 110 is seen toinclude a number of fields containing information of biologicalrelevance to the DNA sequence data within the payload 120 of thebiological data unit 100. The information that is contained within theBioIntelligence header may be stored in multiple containers on abiological data network. See, e.g., FIG. 5.

In one approach, biological data units are created at least in part byspecifically linking information from XML metadata files with particularsegments of BAM file sequence data. In this case, the biological dataunits can be considered a unit of information that a certainrelationship that can be stored or streaming from and to multiple nodeson a network. In this case the information that is contained within theBI header distributed and is able to link with sequence segmentsspecifically. The protocols used for the transmission of these preciselyrelated cluster of information in biological data units is integratedwith a computer implemented program that defines and classifies the linkbetween and among the BioIntelligence header information and the segmentof sequence payload.

It should be appreciated that FIG. 1 provides only one specificexemplary representation of the type of biologically relevantinformation which may be included within a BioIntelligence header ofdistributed biological data units. Accordingly, including other types ofrelevant attributes and information within a BioIntelligence header orthe equivalent, regardless of how the data is represented or configured,is believed to be within the scope of the present disclosure.

In addition, although the following generally describes information asbeing contained or included within various sections of theBioIntelligence header 110, it should be understood that in variousembodiments such headers may distributed and may contain pointers, tagsor links to other structures or memory locations storing the associatedheader information.

Similarly, the payload 120 may contain a representation of the segmentedDNA sequence data of interest, or may include one or more pointers orlinks to other structures or locations containing a representation ofsuch sequence data. In this case, the various segments of a particularwhole genome sequence may be stored in a distributive manner in multiplecontainers that are accessible on a network.

A first section 101 of the BioIntelligence header 110 providesinformation concerning CpG methylation sequence data that pertains tothe various positions of the DNA sequence segment within the payload 120of the biological data unit 100. In other words, the information that iscontained in the ancillary files that are associated with the sequencepoints to section 101. Identification of these CpG islands and themethylation sequence will likely play an important role in understandingregulation of the associated genes and any involvement with disease.

The header information that is contained in section 110 also includes aproperty of chromosome banding pattern in section 102 containinginformation concerning any chromosomal rearrangement observed, known,yet unknown and or may be predicted to be involved with at least onesegment of genome sequence data linked to this attribute. These types ofcytogenetic abnormalities are often associated with severe phenotypiceffects. This information may be configured to be in any other format torepresent the genomic effects of chromosomal rearrangements which areknown to be common in cancer tumor genomics.

Header sections 103 and 104 provide information identifying thebeginning and ending positions for the exons that are contained in theDNA sequence segment included within the payload 120. In the case ofwhole exome sequencing this information represents exons throughout thewhole genome that are expressed in genes. Since exon selection hastissue and cell type specificity, these positions may be different inthe various cell types resulting from a splice variant or alternativesplicing. Along with this DNA coding information for individual exons,header section 105 may represent information in a metadata file of acount of the number of exons contained in the DNA sequence segmentincluded within the payload 120. This type of information is known to berelevant in disorder involving exon skipping and exon duplication.

Certain particular attribute-informational link specifically with one ormore DNA sequence segments within payload 120 having some associationwith a disease will be represented by the attribute informationcontained within section 106. Information that is pertaining to certainknown molecular pathways or systems that may have molecular interactionswith other genes or gene products that would also be described withinthis section of the BI header. Alternatively, since variations of saidcertain gene could be involved in one or more diseases, such informationwould also generally be contained within header section 106.

To the extent the DNA sequence segment in the payload 120 contains apart of a gene, a gene or plurality of genes, then the header section107 provides all of the pertinent information that relate specificallyto the applicable known gene name or gene ID. Header section 108 mayrepresent the type of information that specifies the tissue or cell typewhich may be relevant to the extent and level of expression of thevarious exons that may be encoded in the said gene or segment of genomethat is described in section 105.

The metadata attribute located in the header section 109 will provideinformation concerning all possible open reading frames present withinthe segment of genome sequence data that is contained within the payload102. This type of BioIntelligence attribute will be crucial forcharacterizing disease associated variants which are contained withinwhat appears to be open reading frames that express no proteins orpeptides that are detectable with today's methods.

Header section 110 and 111 represent the metadata annotations thatspecify the start and end positions of the DNA sequence segment that islinked to a specific segment of a BAM file, represented by the payload102. These positions may be considered arbitrary since the positions inthe sequence could be more than one reference sequence.

Section 112 indicates if the segmented DNA sequence data within thepayload 102 is chromosomal, microbial or mitochondrial. Furthermore,section 113 provides information concerning the genus and species of theorigin of the DNA sequence segment represented with the payload 102. Itshould be appreciated that sections 112 and 113 will provide theinformation that describes all the DNA sequence data that is associatedwith an individual including and not limited to microbes attached on theoutside and found on the inside of said individual as well as genomesequence data from plants and other higher animals found in thedigestive track.

All of the metadata annotations and attributes that are within theheader 110 will generally contain prior knowledge information relatingto the BioIntelligence that is relevant to the DNA sequence which isfunctionally utilized while the data is being sorted, filtered andprocessed. This packetized structure of the DNA sequence data that isrepresented in bits and encapsulated with BioIntelligence headers andother relevant information advantageously facilitates processing byexisting network elements operative in accordance with layered orstacked protocol architectures.

For example, The Cancer Genome Atlas consortium has elected to implementbiological data units comprised of BioIntelligence headers consisting ofinformation contained in XML metadata files and payloads comprised ofgenome sequence data contained in the BAM files. In this exemplaryimplementation a first specific type of BioIntelligence information mayreference the tissue type or cell type of the sequence files (section108 of FIG. 1). Similarly, second specific type of BioIntelligenceinformation type may reference a disease type (section 104 of FIG. 1).

Attention is now directed to FIG. 3, which depicts a biological dataunit 300 having a BioIntelligence header 310 and a payload 320containing an instruction-based representation of segmented DNA sequencedata. The type of information that is illustrated in 310 is exemplary.Moreover, this information may be stored in one or more storagecontainers that are accessible on a network. The instruction-basedrepresentation is discussed above and in the copending '234 application.Although the content and representations of the payloads 110 and 310differ, the same type of information is included within theBioIntelligence headers 110 and 310 of the biological data units 100 and300, respectively.

The distributed packetizing of segmented DNA sequence data files and theembedding of biologically and clinically relevant information inbiological data units will enable development of a networked processingarchitecture within which such data may be organized and configured in alayered format. Based on preliminary results, the architecture isexpected to be particularly suited for effecting rapid analysis of largeamounts of data of this type.

In one approach, the header which is contained within such biologicaldata units, is used to qualify or characterize the fragmented orotherwise segmented genome sequence data included within the payloads ofsuch data units. In so doing, biological data units containing segmentedDNA sequence data or other sequence data may now be sorted, filtered andoperated upon based on the associated attribute information containedwithin the ancillary metadata files of the highly distributed dataunits.

For example, a data repository containing biological data unitsincorporating segmented DNA sequence data and related attributeinformation similar to that associated with the header 110 of FIG. 1 maybe quickly and efficiently sorted in accordance with parameters definedby an application. This has been recently demonstrated with a systemthat has reduced to practice the concepts and ideas of the currentdisclosure as the repository that is now known as the Cancer Genome Hub(CGHub) operated by the University of California. In other words, thesame segments of genome sequence may be sorted and analyzed in severaldifferent ways by using the header information associated with, orotherwise directly or indirectly linked to, the payload representationof the sequence segments.

It is highly expected that it would be beneficial to arrange andrepresent all of the genomic sequence information from an individual,e.g., from bacteria, animals, plants to humans, in accordance with thelayered data architecture illustrated in FIG. 2. For example, considerthe case in which a segment of a genome sequence data file of interestis included as the payload of a biological data unit stored in a datacontainer which includes biological data units associated with DNAsequence data of other organisms.

Consider further that if, for example, the DNA sequence data of interestis a particular variant of a human gene associated with breast cancer,such as BRCA1, then such data could be extracted from the container byfiltering the contents of the data container for metadata attributesassociated specifically with the segment of DNA sequence data from theorganism homo sapiens. The data units containing the specific BRCA1variant along with all other DNA data packets containing human DNAsequence data may be easily extracted. However, sorting human DNAsequence data from the DNA sequence data from other organisms may not besufficient enough of a challenge in view of the technical requirementsof certain applications. Accordingly, additional processing andcomparative analysis may be performed in which specific data unitscomprising certain segments of sequence data from human chromosome 17would be filtered out from the data container.

Biological data units having payloads containing DNA sequence segmentsfrom chromosome 17 may provide a reasonable level of filtering. However,in order to efficiently analyze the gene most notably associated withbreast cancer, further processing, sorting and filtering will benecessary. This may be achieved using several methods including but notlimited to filtering on the specific start and end positions within thechromosome (S pos and E pos) or the gene ID (GID) or by disease, breastcancer. If the biological data units that are being sorted containsequence segments data associated with an alternately-spliced variant ofBRCA1, then this information may be contained in the header informationrepresenting the total exon count (see, e.g., header section 105 of FIG.1), in addition to within the header sections including start exon andend exon information sections (see, e.g., header sections 103 and 104).Furthermore, additional information concerning tissue or cell type mayneed to be provided in order to perform the most intricate level ofsorting and filtering of the biological data units associated with aspecific BRCA1 variant.

The packetized structural configuration of the disclosed distributedbiological data units further enable functional integration of a layereddata models such as that depicted in FIG. 2. In particular, eachmetadata attribute of BioIntelligence headers forming at least a part ofor is linked to a particular biological data unit which may beassociated with one or more specific layers of the model. One advantageof using a layered data model is that data from the various layers mayinterrelate during processing of the header information included withinthe set of biological data units being operated on or otherwiseanalyzed. For example, in the exemplary case described above,information from the RNA layer of the model relating to the splicing ofintrons from pre-mRNA was used to identify BRCA splice variants, therebycorrectly facilitating determination of exon start and end positions.

The use of BioIntelligence header information which are consistent witha layered data architecture also advantageously enables substantialchanges to be made to the information associated with one layer of themodel without necessitating that corresponding modifications be made toother layers of the model. For example, sequence variants may beobserved at splice donor and splice acceptor sites which may change thesplicing pattern and mRNA size, protein structure and function, andthese changes may yet be accommodated and mapped to the DNA layerwithout requiring that corresponding changes be made the DNA layer ofthe existing knowledge data model.

Attention is now directed to FIG. 4, which provides a logical flowdiagram of a process 400 for segmentation of biological sequence dataand combining the segments with metadata attributes to form biologicaldata units encapsulated with BioIntelligence headers. The process 400provides one example of a way in which source DNA sequence data may befragmented to generate biological data units containing DNA sequencesegments and associated BioIntelligence header information in accordancewith a layered data model such as the biological data model 200.

In one embodiment the process 400 utilizes sequence feature informationof the type annotated in well-established nucleotide databases 410 suchas, for example, NCBI, EMBL and DDBJ for sorting, configuring andoperating on the sequence data. By mapping the biological informationwithin these databases into various layers of BioIntelligence headerinformation, a layered data model of existing knowledge can beconstructed.

Referring to FIG. 4, human genomic DNA data is shown to be accessiblefrom different storage elements 410. In this regard, the DNA sequencedata can be stored in segments as sequences of individual chromosomes orpartial chromosomes or as individual genes, and may comprise all or partof a genome. In addition, the DNA sequence data could be generated froma sequencing machine and the results made accessible to a network ofcomputers. Further, genomic sequence data might be represented in anyfile format and produced using any approach including, for example, as apartial dipolar charge and phosphorescence sequence profile indicativeof the sequence data.

In a stage 420, the sequence data obtained from storage elements 410 ismapped and aligned with the reference genomic sequence data. The DNAsequence is associated with a set of relevant molecular features using,for example, biological data 414 deemed valid by the scientificcommunity. This data 414 is mapped to specific regions of a sequenceentry. In addition, clinical and pharmacological data 416 demonstratedto be associated with any coding or non-coding regions of a sequenceentry is also mapped.

In one embodiment layer-1 biological data units 444 ₁ include a payloadcomprised of segmented DNA sequence data and a DNA layer header.Similarly, layer-2 biological data units 444 ₂ may include a payloadcomprised of segmented DNA sequence data, a DNA layer header and an RNAlayer header. A layer-N biological data unit 444 _(N) may include apayload comprised of segmented DNA sequence data, a DNA layer header, anRNA layer header, and other headers associated with higher layers of therelevant data model.

Alternatively, in one embodiment layer-1 biological data units 444 ₁ mayinclude a payload comprised of segmented DNA sequence data and a DNAlayer header, layer-2 biological data units 444 ₂ may be comprised of asegmented RNA sequence data and an RNA layer header, and so on. In oneembodiment a base unit may be prepended to or otherwise associated witheach biological data unit in order to identify the specific headersincluded within the data unit and/or the number thereof.

In one embodiment BioIntelligence headers 424 may include physical,chemical, or biological knowledge or findings, or any related moleculardata that has been peer reviewed, published and accepted as valid.BioIntelligence headers 424 may also include clinical, pharmacologicaland environmental data, as well as data from gene expression andmethylation.

In certain embodiments BioIntelligence headers 424 may further includeinformation relating to gene and gene product interaction with othercomponents of a pathway or related pathways. The information withinBioIntelligence headers 424 may also be obtained form, for example,microarray studies, copy number variation data, SNP data, completegenome hybridization, PCR and other related techniques, data types andstudies.

The prior scientific knowledge and information associated with aspecific sequence and included within a BioIntelligence header 424 maybe of several different types including, for example, molecularbiological, clinical, medical and pharmacological information. In thisregard such molecular and biological information could be separated andlayered based on data from, for example, genomics, exomics, epigenomics,transcriptomics, proteomics, and metabolomics in order to yieldBioIntelligence data.

The BioIntelligence data may also include DNA mutation data, splicingand alternative splicing data, as well as data relating toposttranscriptional control (including microRNA and other non-codingsilencing RNA and other nuclease degradation pathways). Massspectrometric data on protein structure and function, mutant proteinproducts with reduced or null function, as well as toxic products couldalso be utilized as BioIntelligence information.

In addition, pharmacological and clinical data relating to specificgenes or gene regions disposed to exert effects through interaction withgene products or other components of a pathway could be considered as aclass of BioIntelligence header information. Finally, BioIntelligenceheader information could also include environmental conditions oreffects correlated with certain genes or gene products known orpredicted to be related to a certain phenotypic effect or disease onset.

As mentioned above, during stage 440 BioIntelligence headers 424 areassociated with segmented DNA sequence data form biological data unitscomprised of a BioIntelligence header 424 encapsulating a payloadcontaining the segmented DNA sequence data. In this process theassociation of a BioIntelligence header 424 to payload containingsegmented genome sequence data may be carried out in any of a number ofways. For example, such association may be effected using a pointertable, tag, graph, dictionary structure, key value stores or byembedding header information directly into the segmented sequence data.

In a stage 460, the biological data units 444 may be organized intoencapsulated data units in accordance with the requirements ofparticular applications. For example, in certain cases it may be desiredto create encapsulated biological data units including only a subset ofthe headers which would otherwise be included in the biological dataunits associated with at least one particular layer of the biologicaldata model of prior knowledge. For example, a certain application mayrequire encapsulated biological data units having headers associatedwith only layers 1, 2 and 5 of a data model.

Another application may require, for example, encapsulated biologicaldata units having headers associated with only layer 2, 3 and 4 of thedata model. Similarly, other applications may require that the headersof the encapsulated biological data units be arranged in a particularorder, e.g., the header for layer 4, followed by the header for layer 1,followed by the header for layer 2.

In a stage 480, the encapsulated biological data units created in stage480 are stored in a manner consistent with being interoperable with oneor more multi-layered, multi-dimensional data containers 464. Thecontent of the headers of the encapsulated biological data units ischosen to promote optimal interoperability among and between layers. Forexample, in one simplified case each biological data unit includedwithin the data container 464 ₁ may include at least a DNA layer header,an RNA layer header, and a protein layer header. It is a feature of thepresent system that information within higher-layer headers (e.g., RNAlayer headers or protein layer headers) may be “mapped” to lower-layerheaders and/or sequence information in such way as to establish arelationship provenance between information within various layers.

Consider an example wherein data concerning a particular protein productthat is expressed in a certain tissue type (i.e., protein layerinformation) may also provide information relating to splicing (i.e.,RNA layer information) or to a SNP at the genomic level (i.e., DNA layerinformation) resulting in a premature termination codon. In other words,protein structure related data can provide RNA level knowledge onalternative splicing as well as data on primary sequence data of aminoacids substitutions revealing SNPs and indels at in the DNA sequence.

In another case, the diagnosis of a certain disease in a certain patientor, for example, results from a mammogram screen or prostate-specificantigen results, may provide information that is directly related tohyper-methylation of certain regions of the DNA sequence segmentincluded within a DNA layer biological data unit. These epigeneticmarkers, along with the methylation profile at CpG islands associatedwith certain genes, could provide crucial BioIntelligence headerinformation to relate and correlate with appropriate gene and diseaseconditions.

One advantage of the layered architecture of the data containers 464 isthat modification or updating of the data content associated with agiven layer has minimal or no effect on the processing of data in theremaining layers. In one embodiment layers are advantageously designedto be operated on independently while retaining the capability tointegrate, and interoperate with, data and existing knowledge of otherlayers. In addition, data can be organized within each data container464 in accordance with the requirements of specific applications.

All or part of this data may be mapped, via linked relationships betweeninformation within BioIntelligence headers or metadata attributes thatare associated with different layers of a data model, to a diseasecondition capable of being associated with a region of segmented DNAsequence data contained within a biological data unit. This enablesbiological data units to be grouped and analyzed based upon theclassification schema required by a particular application.

In a stage 490, biological data units encapsulated with BioIntelligenceheaders and stored with the data containers 464 may subsequently befiltered, sorted or operated upon based on information included withinsuch headers. The layered structure of biological data units comprisedof biological data units including encapsulated BioIntelligence headersenables querying of the information included within one or more suchheaders to be performed and results returned based upon a set of rulesspecified by, for example, the application issuing the query.

Attention is now directed to FIG. 5, which depicts a biological datanetwork 500 comprised of representations of biological data linked andinterrelated by an overlay network 504 containing a plurality of networknodes 510. In one embodiment the network nodes 510 are in communicationvia network elements 520 (e.g, routers and switches) of the Internet 530and thus overlay such Internet elements. Certain of the network nodes510′ may have localized access, via a local area network or the like, todatabases 550 containing the representations of biological sequencedata, clinical data, or other information which are networked in themanner described herein. In one embodiment the network nodes 510′ may beconfigured to locally process information within a database 550 and makeavailable all or part of the results of such processing, and potentiallyinformation within the database 550 itself, to other of the networknodes 510. In addition, the network nodes 510′ may also be designed toperform network processing functions along with the network nodes 510 inthe manner described hereinafter.

The biological data network 500 may in one aspect be viewed ascomprising a network of data stored within the databases 550 as well aswithin storage (not shown) at the network nodes 550. In one embodimenteach biological data sequence or other sequence information storedwithin the network 500 may be accorded a unique identifier such as, forexample, an IP address, in order to facilitate the establishment of sucha data network. Moreover, tables may be maintained at each network node510 for data tracking purposes (references herein to network node 510are generally also intended to refer to network nodes 510′, unless thecontext of the reference clearly suggests otherwise). In particular,such tables may be used to track the sequence information availabledirectly or indirectly (via other network nodes 510) from other networknodes 510, as well as the results of processing such sequenceinformation at various nodes 510. These tables may be updated asbiological data units containing sequence information and/orBioIntelligence and or MetaIntelligence™ headers are transported betweennodes for processing. Alternatively or in addition, overhead messagesmay be exchanged between network nodes 510 for the purpose ofpropagating the information stored within ones of these table to thetables maintained by other nodes 510. Such messaging and updating oftables between network nodes 510 generates a type of BioIntelligent dataawareness that provides a distinct advantage for processing and sharingdata on network 500. Furthermore, the network processing that is carriedout allows seamless access to network-associated processing functions,shared data as well as support databases that also contain properties ofand information about the data.

Structure and Operation of Network Nodes of Biological Data Network

During operation of the network 500, requests from a client terminal 560are received by a network node 510. Such requests are interpreted at thenetwork node 510 and appropriate processing is carried out at suchnetwork node 510, and potentially other network nodes 510, in order toproduce the requested results. In this regard BioIntelligence headersrelating to all of the data throughout the network 500 that isdesignated as or otherwise made network accessible may be accessed andprocessed in response to requests from a client terminal 560. In thisway intelligent information concerning data stored remote from a clientterminal 560 and its associated network node 510, and/or such dataitself, may be processed in a manner transparent to such terminal 560and node 510.

Although certain of the embodiments disclosed herein contemplate thatvarious ones of the network nodes 510 may perform specialized processingfunctions and operate cooperatively to produce an overall processingresult, in other embodiments certain nodes may be capable of performingall of the processing functions necessary to deliver results in responseto queries.

In certain aspects of the invention whereby cooperative operations andprocessing functions are coordinated at various distributed networknodes 510 queries can be made that would facilitate the simulation,study and comprehension of systems in biology. In this case,BioIntelligence header information fields at the DNA, RNA and proteinlayers along with query dependent processing function requirements serveas the activated substrates for generating a result.

In general, when a query/request is made, a suite of protocols areinvoked which are based upon the properties of the request. For example,a request can be made from any client on the network 500 and the stackof application protocols use processing functions at multiple nodes toaccess the associated data and a process management function totabulate, coordinate and combine the partial information from multiplenodes to return the query result. In this regard, processing at anetwork node 510 can be achieved using either of at least twoapproaches. In a first approach of cooperative processing functions,data and or partial processing results can be moved to the desiredfunctional node 510 to be processed. Alternatively, the requiredprocessing function can be moved form a network node 510 to the locationof the network accessible data at 550 and the data is processed at thesite at which it resides on the network 504. Furthermore, a combinationof the two approaches can be used to return the query result to endnodes or terminals 560. In addition, any result from processing that isnew network information can be used to update tables at nodes 510 toenhance network awareness.

The network nodes 510 are aware of the types, the content and locationof all network accessible data and its intelligence. Moreover, thenetwork nodes 510 are aware of the types, locations and capabilities ofprocessing functions on the network 504. In this regard each node 510 isregularly updated with the activities being performed by, and processingresults generated by, each other node 510 of the network 500. In oneembodiment, network-based applications and protocols are aware of theinformation contained in the different fields of the BI headersassociated with the biological data units stored within the databases550 and access such information to the extent necessary to processqueries from terminals 560.

Turning now to FIG. 6, there is illustrated an exemplary protocol stack610 implemented at a network node 510 together with corresponding layersof the OSI network model 600. As shown, the protocol stack 610 includesa DNA Network Protocol Stack (DPS™) over TCP/IP layers. The DPS™supports a BioIntelligence-Aware Network Application capable ofprocessing requests from a client terminal 560 and delivering results.As is discussed below, a network node 510 configured with the protocolstack 610 is capable of performing processing, switching and routingfunctions based upon not only the information within messages associatedwith the TCP/IP layers of the protocol stack 610 but also in accordancewith the higher-layer information within BioIntelligence headers andother information associated with the DPS™. As a consequence, a networknode 510 may use this higher-layer information to prioritize theprocessing of packets received by the network node 510. For example, thenetwork node 510 may control quality of service (“QoS”) and effect loadbalancing based upon this higher-layer information.

The DPS™ is intended to enable existing Internet infrastructure toefficiently process and transport DNA sequence-based data. The DPS™protocol stack comprises a DNA Transport Protocol™ (DTP™), DNA SignalingProtocol™ (DSP™), and DNA Control Protocol™ (DCP™). In one embodimentthe DTP™ protocols enable network elements such as routers and switchersto process, transport, and communicate biological data such as DNAsequence data and related information between single or multiple sourcesof streaming DNA servers (discussed below). The servers will include orhave access to data containers (e.g., storage devices) includingbiological data units and/or unprocessed or partially processed DNAsequence data.

The functions of the DPS™ protocol suite comprise processing,transporting, controlling, switching and routing biological data such asDNA sequence information as streaming data so as to enable such data tobe utilized for a variety of “streaming” applications. In this regardthe DPS™ protocol stack will be used for pulling streaming biologicaldata from servers having access to containers of biological sequencedata. Such streaming applications are capable of continuously “pushing”and “pulling” biological sequence data as necessary to support thefunctionality of each particular application.

Various options exist for introducing the DPS™ protocol suite intoexisting network infrastructure. In one implementation, for example, theDPS™ protocol suite may be distributed throughout the routers/switchesof a given service provider. In another implementation, the DPS™protocol suite may reside only in one or more network elements near anedge of the service provider's network in an overlay network.

FIG. 7 shows a high-level view of the various data types that may beprocessed by a group of network nodes 510 in response to a query/requestreceived from a client terminal 560. As shown, transcriptomics data,proteomics data and/or gene expression data stored as biological dataunits within databases or data containers accessible to the nodes 510may be processed.

Attention is now directed to FIG. 8, which provides a block diagrammaticrepresentation of the architecture of an exemplary network node 510. Asshown, the network node receives incoming IP packets containingBioIntelligent biologically-relevant headers. Encapsulated within suchincoming IP packets will typically be, for example, informationidentifying the particular gene(s) with which such biologically-relevantheaders are associated. Such information could include, for example, theparticular chromosome and position within the chromosome with which thegene is associated, protein information associated with the gene,whether the gene corresponds to a normal or minor allele, or otherinformation pertinent to the gene. In addition, each incoming packetcould also include information uniquely identifying the specific DNAsequence or other biological sequence information and the networklocation at which such sequence is stored. For example, such identifyinginformation (which could be in the form of, for example, an IP addressseparate from the IP address of the incoming IP packet) could identify aparticular network-accessible database and a location or position withsuch database. In other embodiments both information identifying thegene associated with the biologically-relevant headers within theincoming IP packet and information specifying a particular location atwhich the sequence information associated with such headers is storedcould be inherent within a unique identifier included within theincoming IP packet.

Each incoming IP packet containing biologically-relevant headers isreceived via a network interface 810 and provided to an input packetprocessor 820. In one embodiment the network interface is comprised of aphysical port in communication with an external network and furtherincludes, for example, buffers, controllers and timers configured tofacilitate transmission and reception of packetized sequence data andother information over such network. The input packet processor 820removes the IP header information and parses the higher-layer contentincluded within the packet. A classification module 830 may then assignthe packet to a particular class based upon this higher-layer content.The biologically-relevant header information included within the packetmay then be passed to a configurable processing module 850 forprocessing in the manner described hereinafter based upon the determinedclass and any policies applicable to such class defined by policy module840. As is also described hereinafter, the biologically-relevant headerinformation may then be processed by configurable processing module withreference to various sequence location tables 870 and layered datatables 860 maintained at the network node 510. The layered data tables860 are structured consistently with the biological data model (FIG. 2)used to define the biologically-relevant headers within each incoming IPpacket.

Based upon the results of the processing performed by the configurableprocessing module 850, outgoing biologically-relevant header informationassociated with the biological sequence identified within the input IPpacket or other processing results is provided to a transmit controllermodule 880 for packetization within an outgoing IP packet. To the extentthe outgoing biologically-relevant header information requires furtherprocessing by another network node 510 in order to render an appropriateresponse to the user request received by the network 500, a loadbalancing module 882 within the transmit controller module 880 selectssuch a network node 510 from among the group of such nodes capable ofperforming the required processing. Such selection may be based upon,for example, the processing loads associated with each node within thegroup. Additionally, selection may be based upon processing results thatare passed to the transmit controller module 880. A QoS module 884places each outgoing IP packet in one or more queues in accordance with,for example, the applicable class accorded the corresponding incoming IPpacket by the classification module 830 and the policy associated withsuch class. Each outgoing IP packet will generally include identifyinginformation similar to that included within each incoming IP packet. Theoutgoing IP packets are provided by the transmit controller module fromthe applicable queue to the network interface for transmission to adestination network node 510.

In one embodiment the BioIntelligence headers within each IP packetreceived by a network node 510 will be functionally associated with orcontain information having biological relevance to a segment of DNAsequence data, MetaIntelligence™ metadata, or both. It should beappreciated that the BioIntelligence headers may be arranged in anyorder, whether dependent upon or independent of any associated payloaddata. However, in one embodiment the BioIntelligence headers are eachrespectively associated with a particular layer of a biological datacube model representative of the biological sequence data containedwithin the payloads of the biological data units with which such headersare associated. Moreover, it should be understood that anypatient-related data which is not predicated upon genomic sequenceinformation but is nonetheless pertinent to the processing by thenetwork 500 of a request may be included within the BioIntelligenceheaders of a received IP packet.

It should be further understood that BI headers may be realized inessentially any form capable of embedding information within, orassociating such information with, all or part of any biological orother polymeric sequence or plurality thereof. BI headers may also beplaced within a representation of associated DNA sequence data, or couldbe otherwise associated with any electronic file or other electronicstructure representative of molecular information. In particular,biological data units containing segmented DNA sequence data may besorted, filtered and operated upon based on the associated informationcontained within the BioIntelligence header fields.

Attention is now directed to FIG. 9A, which illustratively represents aprocess effected by a network node 510 to implement a sequence variantsprocessing procedure. In many instances the first process performedwithin the network 500 in response to receipt of a user query is theexecution of a variants calling function at a network processing node510. The variants calling function may be executed at the network node510 receiving the user query. Alternatively, the procedure may beexecuted at a network node 510 specially configured for performing acomparative analysis of the subject patient whole or partial genomesequence against the selected reference/control sequence.

In an initial step of the variants processing procedure, a determinationis made as to whether any differences exist between the biological datasequence associated with the query and the reference sequence. To theextent differences are detected, the nature of the differences and theirlocations with respect to the reference sequence are recorded. In thisregard the sequence data associated with the query could comprise aportion of a gene or plurality of genes, an entire genomic sequence fromnormal cells, and/or an entire genomic sequence from diseased cells. Thesequence data for a particular patient could comprise any, or acombination, of these types of sequence data.

In other embodiments a clinically transformed version of a patient'sgenomic sequence data, rather than the sequence data itself, isassociated with user requests received by the network 500. Such aclinical transformation may involve, for example, associating apatient's medical records or health related information with any or acombination of the patient's genomic sequence or the patient'stranscriptomic, proteomic, metabolomic or lipidomic information, or anyother such related data. For example, such transformation could involveusing certain minor allele variations in or near certain genes that areassociated with certain phenotypes, symptoms, syndromes, diseases,disorders, etc. Furthermore, certain knowledge of the linkagedisequilibrium that is associated with the haplotype map genome sequenceof the patient might provide a detailed transformation of thisgenotyping data into information on protein concentrations in blood,urine and other body fluids. Information on functional activity of theseproteins and their metabolic state which might include posttranslationalmodifications could be a useful part of improving the granularity of thepatient's genomic-based transformed data. Accordingly, the presentdisclosure advantageously provides a mechanism for networking andsharing genomic-based data without requiring a corresponding sharing ofa patient's genomic sequence data.

Again considering the process of FIG. 9A, in a comparison operation 910packets of genomic sequence segments 914 are mapped to correspondingportions of a reference sequence 918. In an operation 922, statisticalcorrections are then carried out at the network node 510 on the basis ofthe comparison in order to make a variant call. Variants calls can bechecked against a database of variant alleles since each node hasawareness of such data location on the network. For example, a rarevariant in a certain gene associated with breast cancer might becontained in TCGA database with pertinent information on drug response.This information will have information on clinical responses to certaindrugs that relate directly to the minor allele. The network can accessthe TCGA database and extract the required information for processing onthe network or locally at the client server. The MetaIntelligenceinformation about the data within the TCGA database can then be used toallocate processing functions.

For simplicity, in the case where SNPs are the only variants dbSNP canbe used to validate common SNPs. In addition, data on minor alleles withdisease association might be present in other cancer genome databasesthat are maintained by public and private entities such as but notlimited to CGP (Cancer Genome Project at Sanger Institute), TCGA (atNIH's National Cancer Institute), RCGDB (Roche Cancer Genome Database),and the like.

Attention is now directed to FIG. 9B, which is a flowchart of anexemplary variants processing procedure 930 representative of one mannerin which a network node 510 configured for variants processing may beutilized in connection with processing a particular user request. Inparticular, consider the case in which a structured representation ofthe DNA sequence data of a breast cancer patient is received at anetwork node 510 configured for variants processing along with areference sequence (stage 934). The structured sequence data is thenmapped against the reference in order to produce the specific variantalleles forming the basis of variants calls made by the node 510 (stage940). In this example it is assumed that the request accompanying thesequence data comprised a request to determine the pharmaceutical drugwith the highest efficacy and with lowest toxic effects in view of theDNA sequence data of the patient. Once the specific variant alleles ofthe patient have been determined, the network node 510 configured forvariants processing may issue a query/request that is processed by thosenetwork nodes 510 having access to public and private databasescontaining information relating to pharmacogenomics-based responses tovarious drugs (stage 944). The results of such queries may then bereturned to the requesting client terminal 560 (stage 950), and the drugresponse data for specific variant alleles included within such resultsmay then be used for analysis of the patient data (stage 954).

In the general case, once the processing to be performed at a givennetwork node 510 has been completed, a decision will be made to route orswitch the processing to another network node 510 based upon the resultsof such processing (stage 960). The extent of the processing to beperformed by the network 500 with respect to a particular request willof course be dependent upon the nature of the request.

Turning now to FIG. 10, an illustrative representation is provided ofthe processing occurring at a network node 510 configured to perform aspecialized processing function. As may be appreciated with reference toFIG. 10, a specialized processing function which is required to beperformed is first carried out and the result of such a processingfunction is supported by access to public and private databases withrelevant associated data.

In one embodiment each network node 510 implements a method whichgenerally involves performing a processing operation involving ones of afirst set of biological data units and a second set of biological dataunits. The processing might further involve a comparison of the calledvariant with access to established variants databases.

In the general case, the biological data unit encapsulated within the IPpacket received by a network node 510 will contain a first headerassociated with first information relating to segmented biologicalsequence data and a second header associated with second informationrelating to the segmented biological sequence data. The method includesprocessing of the first information and the second information inrelation to the content of the payload of the biological data unit. Inone embodiment processing is carried out at each network node 510 withrespect to biological data units including a first header associatedwith information relating to a first-layer representation of biologicalsequence data and a second header associated with information relatingto a second-layer representation of biological sequence data wherein abiological, clinical, pharmacological, medical or other suchrelationship exists between the first-layer and second-layerrepresentations. For example, the DNA sequence for a gene may be relatedto the cDNA or RNA sequence of that gene or the protein sequence,structure or function of the gene product. In one embodiment all of thedata contained within a layered representation of the DNA sequenceinformation (see FIG. 2) would be available for a subset of patients ateach client server.

As may be appreciated with reference to FIG. 2, a biological data unitpredicated upon the layered data model of FIG. 2 includes a transformedrepresentation of a biological sequence and a first header associatedwith first information relating to such sequence. Since the headersincluded within such a biological data unit may generally correspond tothe layers of the layered data structure of FIG. 2, it should beunderstood that a processing node 510 that operates on a given layer ofdata will typically be able to access only a certain type of data. Forexample, in one embodiment “layer 1” headers are associated with the DNAlayer and a network node 510 configured for “layer 1” processing wouldaccess DNA-related data.

Attention is now directed to FIG. 11, which provides a representation ofan exemplary processing platform 1100 capable of being configured toimplement a network node 510. The processing platform 1100 includes oneor more processors 1110, along with a memory space 1170, which mayinclude one or more physical memory devices, and may include peripheralssuch as a display 1120, user input output, such as mice, keyboards, etc(not shown), one or more media drives 1130, as well as other devicesused in conjunction with computer systems (not shown for purposes ofclarity).

The platform 1100 may further include a CAM memory device 1150, which isconfigured for very high speed data location by accessing content in thememory rather than addresses as is done in traditional memories. Inaddition, one or more database 1160 may be included to store data suchas compressed or uncompressed biological sequences, dictionaryinformation, metadata or other data or information, such as computerfiles. Database 1160 may be implemented in whole or in part in CAMmemory 1150 or may be in one or more separate physical memory devices.

The platform 1100 may also include one or more network connections 1140configured to send or receive biological data, sequences, instructionsets, or other data or information from other databases or computersystems. The network connection 1140 may allow users to receiveuncompressed or compressed biological sequences from others as well assend uncompressed or compressed sequences. Network connection 1140 mayinclude wired or wireless networks, such as Etherlan networks, T1networks, 802.11 or 802.15 networks, cellular, LTE or other wirelessnetworks, or other networking technologies are known or developed in theart.

Memory space 1170 may be configured to store data as well asinstructions for execution on processor(s) 1110 to implement the methodsdescribed herein. In particular, memory space 1170 may include a networkprocessing module 1172 for performing networked-based processingfunctions as described herein. Memory space 1170 may further include anoperating system (OS) module 1174, a data module 1176 configured totemporarily store sequence data and/or associated attributes ormetadata, a module 1178 for storing results of the processing effectedby the network processing module 1172.

The various modules included within memory space 1170 may be combined orintegrated, in whole or in part, in various implementations. In someimplementations, the functionality shown in FIG. 11 may be incorporated,in whole or in part, in one or more special purpose processor chips orother integrated circuit devices.

Attention is now directed to FIG. 12, which illustrates one manner inwhich data may be processed, managed and stored at an individual networknode 510 in an exemplary clinical environment. In particular, FIG. 12depicts one way in which the information technology systems of a medicalprovider (e.g., an oncologist) could interface with network processingat a node 1210 included within a local area network in communicationwith the data network 500. In one embodiment the network processing node1210 may have similar or identical processing functionality as the nodes510 of the network 500 and would be in communication with at least onesuch node 510, but could also be locally networked with otherinformation technology infrastructure in a campus environment not partof the network 500.

In one embodiment none of the data which is stored in the local storagecontainer 1220 is generally accessible to clients 560 of the network500. Movement of data between storage containers associated with oraccessible to different network nodes 510 may be governed by thepolicies established by the one or more clients 560 controlling suchcontainers. For example, depending on the policy in place at a firstnetwork node 510, certain aspects of actual patient data or atransformed version of such data might be “pulled” in whole or in partfrom data containers accessible to a second network node 510.

BioIntelligence Access to Existing Knowledge

Attention is now directed to FIGS. 13-18, which illustratively representthe manner in which information within the layered data structure 200 isutilized at an individual network processing node 510. In particular,each of FIGS. 13-18 depict an exemplary representation of therelationship between information in the BioIntelligence headers 1304 ofa biological data unit associated with a query message and priorknowledge 1308 within storage accessible to the node 510 that is used ingenerating a response to the message. It should be understood that FIGS.13-18 provide only one example of a set of three layers of a BI headerinformation or metadata attributes which are directly associated withthe various layers of the knowledge structure.

As may be appreciated by reference to FIGS. 13-18, the first field ofinformation present within each BI layer header specifically relates toa first source of data and/or knowledge associated with such BI header.For example, the fields within the “layer 1” header 1310 will relatedirectly with a first layer of the structured knowledge data model. Inthis case the fields within the layer 1, or “L1” header 1310 can relatewith L1 data (i.e., DNA-related data in the case of the data model 200).Consequently, information that is contained in the fields of the layer2, or “L2”, header relate directly but not strictly with the datapresented in the second layer or the RNA layer data and knowledgepresented in that layer.

Referring now specifically to FIG. 13, “H1” represents a first of theBioIntelligence information within the L1 set of attributes thatrepresent header 1310 of a given data packet. In the example of FIG. 13the particular attributes within section L1 header 1310 directlycorrespond to characteristics of the first layer (i.e., the DNA layer210) of the layered model of existing related knowledge 200.

It should be noted that FIG. 13 depicts only the different layers ofheaders and the various header information fields, and not anyassociated payload of segmented sequence data, of a particularbiological data unit. As discussed above, IP packets based upon aparticular biological data unit which is exchanged between network nodes510 may or may not include such payload data (i.e., such IP packets mayonly include higher level abstracted attribute information correspondingto the biological data unit).

In the embodiment of FIG. 13, the header field H1 within the L1 header1310 relates to a particular type of information pertinent to the DNAlayer 210. For example, as indicated by DNA-layer table 1320 maintainedby the individual network processing node 510, the field H1 within theL1 header 1310 may point to the base positions for a sequence of genomicdata within the payload of the biological data unit containing headers1304. The layered prior knowledge that is being accessed or related orpointed to by BioIntelligence attributes such as H1 is specificallyassociated with DNA layer information of data 1308.

The segmented sequence data within the payload of the biological dataunit identified by the field H1 within the L1 header 1310 may representa certain region of a genome that may be positioned in similar but notnecessarily identical base positions. For example, the comparison ofthis region or section of the genome that is represented in the payloadfor a particular gene would be expected to code for the same genes or atleast different isoforms of the same gene.

As a result, the effect of L1H1 header field (layer 1, header field 1)from the stored DNA data would give comparable results for the variousDNA layer annotations that are present in that data container. Such DNAlayer information could include, for example, gene ID, chromosome, basepositions, regulatory regions, 5′ and 3′ UTR, variant alleles and otherDNA-based information related to the gene. Based on the query message,the individual network processing node 510 accesses information withindata cubical of prior knowledge 1308 relating to, for example,chromosome number (for simplicity, not shown) and base positionsidentified by the L1H1 header field.

Referring now to FIG. 14, “H2” represents a second attribute ofBioIntelligence header information within the L1 header 1310 of thecertain data packet (i.e., the “L1H2” header field). In this case, theL1H2 header field refers to a second field in the DNA layer that pointsspecifically to the associated gene or gene product related to thepacketized segment of DNA sequence data within the biological data unitassociated with headers 1304. Such sequence data could, for example,code for one gene, a plurality of genes or a part of a gene (representedin either the + or − orientation based on the 5′ to 3′ direction of thesense strand). As indicated by FIG. 14, the L1H2 attribute field relatesor points to the gene ID section of the distributed network-accessibledata 1308.

In one embodiment this field should contain at least one representationfor the name of the gene and or gene product that is encoded by the DNAsequence in the payload of the biological data unit associated withheaders 1304. In cases where more than one name is used to identify agene, gene product or the activity associated with that gene the mostcurrent and widely accepted names are listed. Any gene ID name that isused to relate specifically to the sequence represented by thechromosome number and base positions that are indicated in the firstheader field of the layer 1 should be encoded by this particularsequence in this region of the genome. However, because of geneduplication, copy number variations, existence of gene families, repeatsequences, mobile transposable elements and other such related molecularphenomena certain classes of redundancy will exist. Furthermore, onegene or the polypeptide product of a gene or the enzymatic activity of agene could be associated with more than one disease, syndrome, disorder,phenotype, etc.

Turning now to FIG. 15, “H3” represents a third field of headerinformation within the L1 header 1310 of the certain data packet (i.e.,the “L1H3” header field). In this case, the L1H3 header field relates toany phenotypic expression of encoded gene that is associated with adisease or disorder. That is, in the example of FIG. 15 the L1H3” headerfield points to disease(s) known or predicted to be associated with thegene, a mutated or variant form of the gene, or an expressed geneproduct.

For simplicity and clarity, the supportive data in this case show threedifferent cancer types that are associated with packaged genome sequencedata attached to the exemplary header fields. The diseases that areknown to have association with the segmented sequence in the payload ofthis biological data unit in this case are colon, cervical and breastcancers. The gene or sequence segment might represent an up-regulatedoncogene or proto-oncogene, a down-regulated tumor suppressor gene or astructural or functional gene involved in a pathway with other genesassociated with the disease.

Referring now to FIG. 16, a first field of information within the L2header 1610 of the certain data packet is denoted by “H1”. In theexample of FIG. 16 the header fields within the L2 header 1610 directlycorrespond to characteristics of the second layer (i.e., the RNA layer210) of the layered data model 200. It should be appreciated thatnetwork access to the data that relates to the diseases associated withany packetized segment of DNA sequence data will be through a layer 1(DNA layer) access. Access to data associated with other layers, e.g.,layer 2 and layer 3, will require access to information associated withthe header fields of layer 2 or layer 3. That is, the header fieldsassociated with the L1 header 1310 will generally relate only to data inthe DNA layer 210 of the layered data structure 200, the header fieldswithin the L2 header 1610 will relate only to data within the RNA layer220, and so on. Such RNA-layer data related to a gene of interest couldinclude, for example, the lengths of the pre-mRNA and mature mRNA, exonselection, alternate splicing, data on differential expression of RNA,transcription control and any RNA-related information.

As shown in FIG. 16, fields within the L2 header 1610 relate to the RNAlayer 220 of the layered data structure 200. For example, in theembodiment of FIG. 16 the H1 field may relate to the transcription startsite of the mRNA for the gene identified by fields of the L1 header1310. In other words, the transcription start site information includedwithin the RNA layer 220 would relate to the chromosomal position of thegene. It should be understood that all of the information and field datain FIG. 16 is exemplary, and none of such information actually relatesto any information concerning any particular gene. For instance, whereBRCA1 might be used to indicate a gene and chromosome 17 the chromosome,all of the information in the related table 1620 is exemplary. Thus,information within the RNA layer 220 and the DNA layer 210 areassociated and interrelated by layered data structure 200 in a mannerthat allows independent access to the different information and or datatypes or layers.

Attention is now directed to FIG. 17, in which “H2” represents a secondfield of header information within the L2 header 1610 of the certaindata packet (i.e., the “L2H2” header field). In this case, the L2H2header field relates to RNA-layer information pertaining to the lengthof a transcript. The RNA data on this particular gene shows a variety oflengths for the transcript. Entries that harbor an insertion showrelatively longer transcript length; conversely, the shorter lengthtranscripts show deleted bases in comparison with the normal case.

Referring now to FIG. 18, the third field (“H3”) of header informationwithin the L2 header 1610 may relate to other information associatedwith the RNA layer 220. For example, this “H3L2” header field may relateto the exon selection of a gene associated with breast cancer.

In this example, the variations in the number of exons that arecontained in this gene indicate the existence of different splicevariants that are associated with the transcripts from cell taken fromthe breast tumor tissue. The defect in splicing could be from variantsof the gene or some component of the splicing mechanism.

In the embodiment of FIG. 18, layer 3 (“L3”) headers 1810 may includeinformation associated with a protein layer of the data model 200. Suchprotein-layer information may include, for example, the molecular weightof the protein product of the gene identified by the L1 header 1310,amino acid count and content, expression level, activity,posttranslational modifications, structure, function and other relatedinformation.

Although FIG. 18 does not explicitly depict the relationship between thefields of the L3 header 1810 and corresponding portions of the datacubical 1308, such fields are related to the protein-layer data withincubical 1308 in a manner consistent with that described above withrespect to DNA-layer and RNA-layer information.

Aggregating Biological Data Units and Associated Information ScriptsUsing BioIntelligence

Attention is now directed to FIG. 19, which depicts a Smart Repository™1910 configured to retrieve and aggregate genomic-related and other datarelevant to the interests of actors interacting with the SmartRepository™ 1910. In one embodiment the Smart Repository™ 1910 maycollect and provide information relevant to, but different from,information explicitly requested within queries received at the SmartRepository™ 1910 from such actors. Such data may comprise, for example,clinical and/or research data pertinent to a received query that istailored to interests of a requesting actor. In other embodiments theSmart Repository™ 1910 may infer the interests of such actors based uponthe sequence-related information uploaded or downloaded by such actorsto and from the Smart Repository™ 1910. Based upon such inferredinterests of a given actor the Smart Repository™ 1910 may then, eitherin connection with such uploading/downloading activities of the actor orotherwise, retrieve and aggregate such genomic-related and/or other dataand provide the aggregated information to the actor.

In one embodiment the Smart Repository™ 1910 comprises a node of, or isin network communication with, a biological data network 1914 containinga plurality of other nodes 1918 and/or is in communication with otherdata networks, such as the Internet. In such embodiment the SmartRepository™ 1910 may be, except with regard to the informationaggregation functionality described below, functionally andarchitecturally similar or identical to the network nodes 510 describedabove. In other embodiments the Smart Repository™ 1910 is configured toperform only the information aggregation functions described hereinafterand is not otherwise configured for networked-based processing ofsequence-related information. In still other embodiments the SmartRepository™ 1910 is not included within a biological data network buthas access to information within other networks, such as the Internet.

As shown in FIG. 19, the Smart Repository™ 1910 includes a SmartTracker™module 1920 and a transcriptor 1930. The Smart Repository™ node 1910also includes or has access to a genome data repository 1940 containing,for example, metadata files relating to sequence information storedwithin the repository 1940 or elsewhere in the biological data networkin which the Smart Repository™ 1910 is included. Finally, the SmartRepository™ 1910 may have access to other clinical or research datastored elsewhere in the biological data network 1914 or within otherdata networks in communication with the biological data network 1914.

In one embodiment the transcriptor 1930 operates to substantiallycontinuously monitor the biological data network 1914 and such otherdata networks for information of potential relevance to users of theSmart Repository™ 1910. Certain of such information may then beretrieved by the Smart Repository™ 1910 and cached within the genomedata repository 1940. For example, the transcriptor 1930 may collectdrug efficacy and other information relating to sequence data storedwithin the repository 1940 which contains various biomarkers and isassociated with particular disease conditions. Such information may berelatively detailed and comprehensive. For example, information relatingto drug efficacy may include a “confidence” score associated with theinformation; that is, an indication of the level of confidenceassociated with the efficacy information. A high confidence score couldbe assigned to drugs for which relatively large amounts of patient dataare available to confirm the reported efficacy, while relatively lowerconfidence scores could be assigned in the absence of extensivecorroboration patient data.

In a first mode of operation, the SmartTracker™ module 1920 receives aquery 1950 from an actor 1956 in the form of, for example, a clientcomputer similar or identical to the client terminal 560. In oneembodiment the SmartTracker™ module 1920 is, among other things,configured to track the uploading and downloading of sequence-relatedinformation occurring between the actor 1956 and the Smart Repository™1910. In one embodiment the transcriptor 1930 assembles, based upon thesubject matter of the query 1950, the sequence-related upload and/ordownload activity of the requesting actor and/or other aspects of theactor's profile, a script of information of various different typesdetermined to be of relevance to the query in view of the interests ofthe requesting actor. This assembling may include, for example, parsingfields of the metadata files 1944 associated with files of sequence data1942 determined to be relevant to a query 1950 in order to identifyclinical and/or biological information related to the query. Once suchclinical and/or biological information has been identified, itsrelevance may be quantified and ranked and a script 1960 containing suchinformation is provided to the requesting actor 1956. In one embodimentthe script 1960 may also be locally cached within the genome datarepository 1940 and provided to other actors (not shown) which theSmartTracker™ module 1920 presently or subsequently determines possessinterests similar to the requesting actor 1956. In other embodiments thescript 1960 may be aggregated with other information scripts cachedwithin the genome data repository 1940. These aggregated scripts maythen be combined with other relevant information for inclusion withinscripts of information subsequently generated by the transcriptor 1930in response to subsequent requests for genomic-related informationreceived by the Smart Repository™ 1910.

In a second mode of operation, the SmartTracker™ module 1920 tracks theuploading and downloading activity of sequence-related informationoccurring between the actor 1956 and the Smart Repository™ 1910 andinstructs the transcriptor to assemble a script of related informationbased upon one or more aspects of this activity. The resultant script ofrelated information may then be provided by the transcriptor 1960 to theactor 1956 in connection with an uploading or downloading transactioninitiated by the actor 1956. In one embodiment the contents of thisscript of related information is not necessarily pertinent to specificinformation included within a particular request from a requestingactor, but rather is selected by the transcriptor 1960 based upon theuploading/downloading activity and/or other system usage of the actor1956.

In one embodiment the sequence data and BioIntelligence informationassembled by the transcriptor 1960 in response to a request from anactor is provided to such actor through an entitlement control module1970. For example, an authenticated actor can query the repository for alist of all the genome data files of individuals with colorectal cancerthat were uploaded within the current year by the actor 1956 or otheractors (not shown). Such a query would return a complete list of thefiles. However, in one embodiment the actor 1956 would be permitted toaccess only those files within the list which the entitlement controlmodule 1970 determines the actor 1956 is authorized to access. Forexample, in certain embodiments only the owners of a subset of thelisted genome data files may have consented to permit the actor 1956 todownload or otherwise access such files. In embodiments in which theactor 1956 subscribes to services offered by the Smart Repository™ 1910and/or biological data network 1914, the entitlement control module 1970could be further configured to determine whether or not the requestingactor 1956 has a current subscription and, if so, the level or “qualityof service” associated with the subscription. For example, inembodiments in which the actor 1956 has subscribed to a relativelyhigher quality of service, the information within the script provided bythe transcriptor 1960 may be more recent or drawn from a wider varietyof sources than would be the case had the actor 1956 opted for a lowerquality of service.

In one embodiment the transcriptor 1930 may track attribute informationlocated in the fields of the metadata files 1944 or BioIntelligenceheaders associated with the genomic sequence data files 1942. Theinformation contained in the metadata files 1944 may be of any pertinenttype including, without limitation, health record, image, clinical,pharmacological, medical, environmental and social data.

In certain embodiments the genome sequence data files 1942 may comprisedisease-normal matched pair genomic sequence data (i.e., genomicsequence data associated with diseased tissue and “matched” genomicsequence data associated with normal tissue from the same individual).In this embodiment the Smart Tracker™ 1920 may track metadata attributescontaining higher-order information with research and clinical relevanceand instruct the transcriptor 1930 to include this type of informationwithin the metadata files 1944. For example, the metadata files 1944 maycontain annotation and attribute information such as, withoutlimitation, germline and somatic genome variants, CNV data, methylationsequence data, microbiome, metabolome, transcriptome, proteome and anyother related structure, function and genetic data.

The functionality of the transcriptor 1930 may be determined at least inpart by the type of information incorporated into the metadata files1944. For example, in metadata files 1944 containingtranscriptome-related information, the transcriptor 1930 may beconfigured to aggregate data such a microRNA-Seq, mRNA-Seq and any othertranscriptomic information of or pertaining to alternative splicing,differential expression and regulation.

Attention is now directed to FIG. 20, which depicts a Smart Repository™2010 which includes a SmartTracker™ module 2018 and a transactor 2020.In the embodiment of FIG. 20, the transactor 2020 operates to assignactors 2024 disposed to interact with the Smart Repository™ 2010 tovarious “casts” of actors based upon the interaction between the variousactors 2024 and the Smart Repository™ 2010. The Smart Repository™ 2010is substantially similar to the Smart Repository™ 1910 (FIG. 19), andincludes a genome data repository 2040 containing, for example, metadatafiles 2042 relating to sequence information files 2044 stored within therepository 2040 or elsewhere in the biological data network (not shown)in which the Smart Repository™ 2010 is included. The Smart Repository™1910 may also have access to other clinical or research data storedelsewhere in such biological data network or within other data networks.

As shown in FIG. 20, the SmartTracker™ module 2018 may receive a query2150 from an actor 2024X comprised of, for example, a client computer.In one embodiment the SmartTracker™ module 2018 is, among other things,configured to track the uploading and downloading of sequence-relatedinformation occurring between the actor 2024X and the Smart Repository™2010.

In general, the transactor 2020 functions to monitor such interactionbased upon information provided by the SmartTracker™ module 2018 andassign actors 2024 exhibiting similar behavior to the same cast ofactors. For example, actors 2024 tending to download, from the SmartRepository™ 2010, similar segments of genomic sequence data or otherinformation could be grouped by the transactor 2020 into the same Cast X2030 comprised of actors 2024X.

As is described in the above-referenced provisional application Ser. No.61/539,942, grouping of the actors 2024X into a common Cast X 2030enables large files of sequence data 2044 and other information to bedownloaded by such actors 2024X using a genomic sequence transferprotocol designed to more efficiently utilize the network bandwidthavailable to the Smart Repository™ 2010. The disclosed genomic sequencetransfer protocol provides a method for secure, high-speed file transferwhich is capable of overcoming the disadvantages of TCP and existingpeer-to-peer protocols with respect to the distribution of files of verylarge size. Like other peer-to-peer file distribution systems, thedisclosed high-speed file transfer system disclosed in theabove-referenced provisional application utilizes a tracker (e.g., theSmartTracker™ 1920) to enable a plurality of actors 2024X within theCastX 2030 to cooperatively distribute a file of interest. Within thecontext of the genomic sequence transfer protocol, the transactor 2020operates to identify and make a record of those Actors 2024 whichrequest a certain file of interest (e.g., “file X”). The transactor 2024will also generally include or be paired with an entitlement controlmodule 2040 configured to determine the authentication and entitlementof each actor based on authorization rules and using a secure keydistribution scheme.

In one embodiment the transactor 2020 may determine which actors 2024are assigned to a particular cast (e.g., Cast X 2030) based upon, forexample, the file requested, the location of the file (i.e., with whichactor(s) 2024 the file is currently stored), as well as the credentialsof the actors 2024 requesting access to the file. Once an actor 2024 hasbeen directed to a particular cast, the actor 2024 exchanges messages2046 with other actors 2024 within the cast in order to determine andreceive the portions of the file of interest currently possessed by theCast X 2030. Stated differently, the transactor 2020 proactively directsa requesting leecher actor to a feeder affinity group such that theleecher receives as much of the requested file as possible without, tothe extent possible, incrementing the burden on the seed of file X.

In the case of very large files, such as files containing genomic orother biological sequence information, the disclosed genomic sequencetransfer approach effectively “parallelizes” the transfer of fileinformation and reduces the burden on the initial seed or seeds of fileX. Moreover, the use of parallel streams within the disclosed systemminimizes the effect of a multiplicative decrease in the speed of anyone stream resulting from the characteristics of TCP. Thus, use of thedisclosed genomic sequence transfer protocol may reduce the likelihoodof bottlenecks developing around overburdened seed servers in connectionwith the transfer of very large data files.

The use of such parallel streams also enables the separate encryption ofeach individual file segment, thus obviating the need for re-encryptionand retransmission of the entire file in the event of corruption of anindividual segment. Particularly in the case of very large data filescontaining sensitive information (e.g., files containing genomicsequence information), this aspect of the disclosed genomic sequencetransfer protocol may offer considerable advantages relative to existingmethods of file distribution.

Turning now to FIG. 21, there is illustrated an implementation of aSmart Repository™ 2110 which includes a SmartTracker™ module 2118, atransactor 2120 and a transcriptor 2128. Except to the extent describedotherwise below, the transactor 2120 functions in a substantiallyidentical fashion as the transactor 2010 (FIG. 20) and the transcriptor2128 functions substantially identically to the transcriptor 1930 (FIG.19). For example, during operation the transactor 2120 assigns actors2124 disposed to interact with the Smart Repository™ 2010 to variouscasts based upon information provided by the SmartTracker™ module 2118relating to the interaction between the various actors 2124 and theSmart Repository™ 2110. The Smart Repository™ 2110 also includes agenome data repository 2140 containing, for example, metadata files 2142relating to files of sequence information 2144 stored within therepository 2140 or elsewhere in the biological data network (not shown)in which the Smart Repository™ 2010 is included. The Smart Repository™2110 may also have access to other clinical or research data storedelsewhere in such biological data network or within other data networks.

In one embodiment, the transcriptor 2128 will use the metadata collectedby the SmartTracker™ module 2118 and stored in the metadata files 2142of the genome data repository 2140 to form an aggregated script ofassociated information. Based on classification schemes which may becontinuously updated and curation of such information within themetadata files 2142 by, for example, subject matter experts, thetranscriptor 2128 assemble one or more scripts 2160 of stratified,highly-relevant clinical and research data. This assembling may include,for example, parsing fields of the metadata files 2142 associated withsequence data files 2144 determined to be relevant to a query 2150 inorder to identify clinical and/or biological information related to thequery 2050. This assembling may further include, for example, evaluatingthe sequence-related upload/download activities of the requesting actor2142X. Once such clinical and/or biological information has beenidentified and such upload/download activity evaluated, the relevance ofthe information may be quantified and ranked and a script 2160containing such information is provided to the requesting actor 2142X.

The Information within the metadata files 2142 will typically includeone or more tags identifying the corresponding sequence data to whichsuch information relates as well as relevant annotations and abstracteddata. The information within the metadata files 2142 or BioIntelligenceheaders will typically include, without limitation, information frompathology reports associated with the corresponding sequence data suchas, for example, gross and microscopic descriptions, diagnosis, tumorsize and grade. The metadata files 2142 may also include informationconcerning a patient's blood report that is relevant to a givendiagnosis, as well as treatment options relevant to such diagnosis.These information fields relating to such blood report will generallyinclude, for example, red blood cell (RBC), leukocytes, platelets orthrombocytes counts, hemoglobin concentration, hematocrit measures,erythrocyte size test and mean corpuscular measures. The metadata files2142 may also include information concerning a patient's cytology reportindicating, for example, the presence or absence of atypical cellsand/or malignant dysplasia.

Metadata Attributes Associated with Genome Data

Set forth below is a representative list of the type of exemplaryinformation fields which may be included within the metadata files 2142or otherwise stored within the genome data repository 2140 as associatedBioIntelligence header information.

Molecular Metadata Attributes

-   -   ID or UUID: Universal Unique ID corresponding to a sequence data        file or to other information related to the sequence information        of a sequence data file    -   Disease: The diseases which are associated with a sequence data        file    -   Cell: Cell or tissue type used to prepare analyte    -   CNV: Relevant information on copy number variation    -   SV: Related structural variants and chromosomal rearrangements    -   SNP: SNPs associated with the diseases    -   microRNA: Correlated microRNA expression information    -   mRNA: Differential expression of associated genes    -   Splice: Any information on splice variants and alternative        splicing    -   Methylation: DNA methylation, hetero chromatin and Methyl-Seq        information    -   Pathway: Information on known or predicted pathways    -   Gene: Information on known or predicted genes    -   Activity: Molecular activities in the related pathway; kinase,        methylation, phosphorylation    -   Regulation: Mutations in known or predicted regulatory regions    -   Exogenous: Relevant microbial genome information    -   Mobile: Information on transposable DNA elements    -   Repeats: Available information on any tandem or interspersed DNA        repeat sequences associated with the disease    -   Protein: Information on body fluids protein concentration and        activity

Clinical Oncology Metadata Fields

-   -   Age:    -   Tumor size:    -   Tumor grade: Cellular differentiation    -   Tumor stage:    -   Tumor behavior:    -   Origin: Organ, tissue, cell    -   Node status: Positive or negative    -   Hormone receptor: Positive or negative    -   Laboratory procedures ordered:        -   DNA preparation        -   RNA purification        -   DNase treatment        -   PCR amplification        -   cDNA purification        -   microarray hybridization; scanning        -   next generation sequencing            -   raw reads            -   sorting            -   align and Map            -   calling variants            -   clinical associations            -   molecular associations

In one embodiment some or all of the information within the genome datarepository 2140 is linked on the basis of UUIDs corresponding to ones ofthe sequence data files 2144. For example, consider the exemplary casein which a sample of a cancerous tumor is taken from the patient. Inthis case a first UUID could be assigned to the tumor sample and storedwithin the genome data repository 2140 in association with informationrelating to the tumor (e.g., size of tumor, date taken, proceduresperformed). This first UUID could also be linked to medical or otherrecords associated with the patient.

Based upon tissue from the tumor sample, various analytes (e.g., DNA,RNA) may be derived and purified. Each of these purified analytes maythen also be associated with a different UUID, all of which are linkedto the first or primary UUID associated with the tumor sample itself.Various information relating to each such analyte (e.g., concentrationof the analyte within the test tube or other analyte repository, name orother information identifying the individual responsible for preparingthe analyte sample) may be stored in connection with the UUIDcorresponding to the analyte. An aliquot of the solution of one suchpurified analyte (DNA, RNA, etc.) may then be obtained, assigned a UUIDlinked to, for example, either or both of the UUID of the analytesolution and the first or primary UUID of the tumor sample. The aliquotmay then be provided to a sequencing machine and another UUID assignedto the resultant sequence data. In addition to base-pair sequence data,such sequence data may include information relating to, for example, themachine used to perform the sequencing and the individual(s) responsiblefor operating such machine at the time of the sequencing. In addition,variant calls may be made with respect to such sequence information andthe corresponding sequence variants may be assigned UUIDs linked to theUUID of such sequence information.

In one embodiment the Smart Repository™ 2110 provides the sequence datagenerated by the sequencing machine to another node of a biological datanetwork in which the Smart Repository™ 2110 is included and such nodeperforms variants call processing to determine the sequence variantscorresponding to one or more portions of the sequence data and providessuch sequence variants to the Smart Repository™ 2110. Each of thesesequence variants may then be assigned a UUID linked to the underlyingsequence data and stored within the genome data repository 2140.Similarly, either the underlying sequence data or the associatedsequence variants may be correlated with, for example, drug efficacyinformation. Such correlated information may be assigned one or moreUUIDs linked to the UUIDs of the underlying sequence data and/or thesequence variants and stored within the genome data repository 2140.

As a consequence of this linked relationship among data records withinthe genome data repository 2140, an actor 2140X may submit a query tothe repository 2140 identifying one or more UUIDs (e.g., the UUIDrelating to a particular tumor sample), and receive information relatingto some or all of the data records associated with UUIDs linked to theidentified UUID(s). The particular fields of the records within therepository 2140 which are evaluated by the transcriptor 2128 in responseto a query in order to identify a relevant set of UUIDs to be returnedin response to such a query will generally be dependent upon the subjectmatter of the query. For example, the transcriptor 2128 would evaluateattribute information from a different set of fields within the metadatafiles 2142, BioIntelligence headers or other records stored within therepository 2140 in response to a query relating to breast cancer thanwould be evaluated in response to a query relating to prostate cancer.This difference might range from particular types of sequence variantsand modifications associated with certain cancers, to the pertinentclinical information that may be taken from the patient's laboratoryreports.

Referring to FIG. 22, there is shown a flowchart representative ofexemplary interaction 2200 between an actor 2124X and the SmartRepository™ 2110. In the exemplary interaction 2200 illustrated by FIG.22, the Smart Repository™ 2110 receives a request from the actor 2124Xto transfer one or more sequence data files or to provide otherinformation (stage 2210). In this embodiment the SmartTracker™ module2118 is aware of, or may determine based upon contents of the metadatafiles 2142, the sequence data files 2144 potentially relevant to thequery (stage 2220) and assembles the UUIDs corresponding to these filesinto a list. The sequence data files may be in any format including, forexample, the binary alignment map (BAM) format. The SmartTracker™ module2118 then identifies, based upon this list of UUIDs, attributes of themetadata files 2142 associated with the sequence data files previouslydetermined to be potentially relevant to the query. In a stage 2240 thetranscriptor 2128 then determines, in view of the request and the priorsystem usage (e.g., sequence-related upload/download activity) of therequesting actor 2124X, a most pertinent set of sequence data files 2144and metadata files 2142. The transcriptor 2128 then works in concertwith the SmartTracker™ module 2118 to form, from this most pertinent setof sequence data files 2144 and metadata files 2142, an aggregatedscript of sequence information and related metadata (stage 2250). Theaggregated script is then encapsulated with header information (stage2260). This encapsulated information script may then be packetized andthe resulting data packet(s) sent to the requesting actor over a network(stage 2270).

Turning now to FIG. 23, there is illustrated an alternate implementationof a Smart Repository™ 2300 including a GeneTransfer Executive module2310 and a transcriptor 2320. As shown, the transcriptor 2320 integratesa Smart Tracker™ module 2324 together with a correlation engine 2328 anda metadata module 2332. In one embodiment the metadata module 2332 isconfigured for aggregating and managing metadata related to queriesreceived by the Smart Repository™ 2300 in the manner described herein.

The GeneTransfer Executive module 2310 includes a transactor 2340, aGeneTransfer module 2344, access control manager 2348 and an encryptionengine 2352. In one embodiment the GeneTransfer module 2344 generatespacketized biological data units in the manner described herein. Thesepacketized biological data units are then encrypted by the encryptionengine 2352 prior to being sent by a network interface 2360 to arequesting client actor (not shown).

The Smart Repository™ 2300 further includes a genome data repository2370, metadata storage 2374 and a repository of prior knowledge 2378.The network interface 2360 also facilitates the transfer of genomicsequence data, metadata and prior knowledge between, on the one hand,the genome data repository 2370, metadata storage 2374 and repository ofprior knowledge 2378 and, on the other hand, the GeneTransfer Executive2310 and the transcriptor 2320.

Attention is now directed to FIG. 24, which depicts an exemplaryimplementation of an actor 2400 configured to interact as a client withthe Smart Repository™ 2300 of FIG. 23. As shown, the actor 2400 includesa processor 2410 and a memory space 2470, which may include one or morephysical memory devices. The actor 2400 may also include peripheralssuch as a display 2420, user input output, such as mice, keyboards, etc(not shown), one or more media drives 2430, as well as other devicesused in conjunction with computer systems (not shown for purposes ofclarity). In addition, one or more databases 2460 may be included tostore data such as compressed or uncompressed biological sequences,dictionary information, metadata or other data or information, such ascomputer files. Database 2460 may be implemented in one or more separatephysical memory devices.

The actor 2400 may also include one or more network connections 2440configured to send or receive biological data, sequences, instructionsets, or other data or information to and from Smart Repositories orother databases or computer systems. The network connection 2440 mayallow users to receive uncompressed or compressed biological sequencesfrom, for example, the Smart Repository™ 2300 as well as senduncompressed or compressed sequences. Network connection 2440 mayinclude wired or wireless networks, such as Etherlan networks, T1networks, 802.11 or 802.15 networks, cellular, LTE or other wirelessnetworks, or other networking technologies are known or developed in theart.

Memory space 2470 may be configured to store data as well asinstructions for execution on the processor 2470 to implement themethods described herein. In particular, memory space 2470 may include aGeneTransfer client module 2472 for transferring genomic sequenceinformation to and from the GeneTransfer module 2344 within the SmartRepository™ 2300. Memory space 2470 may further include an operatingsystem (OS) module 2474, a data module 2476 configured to temporarilystore sequence data and/or associated attributes or metadata, and adecryption module 2478 for decrypting encrypted biological data units orother encrypted information received from the Smart Repository™ 2300.

Attention is now directed to the process flow diagram of FIG. 25A andthe flowchart of FIG. 25B, which collectively provide a more detailedrepresentation of exemplary process 2500 performed by the SmartRepository™ 2110 in processing a request from an actor 2124. In oneembodiment the Smart Repository™ 2110 generates a two-part response toeach query message received from an actor 2124. In particular, the SmartRepository™ 2110 may generate and send to the actor 2124 a specificentitlement-controlled response 2504 comprised of, for example, a listof the UUIDs of sequence data files 2144 and associated metadata files2142 corresponding to the request. As is discussed in further detailbelow, the Smart Repository™ 2110 may also generate a supplementaryresponse 2508 comprised a script of other information determined to berelevant to the request.

The process 2500 is initiated in response to a query sent by the actor2124 to one or Smart Repositories, such as the Smart Repository™ 2110(stage 2501). It should be understood that in certain embodiments theactor 2124 may submit requests to a single Smart Repository™, such asthe Smart Repository™ 2110. In such embodiments the Smart Repository™2110 may be capable of responding to the request directly.Alternatively, the Smart Repository™ 2110 may parse the request, sendcorresponding requests to one or more other Smart Repositories includedwithin a biological data network accessible to the Smart Repository™2110, and forward a response to the request to the actor 2124 based uponinformation included within the Smart Repository™ 2110 and/or providedby such other Smart Repositories. In other embodiments the actor 2124may send the request to a group of Smart Repositories and furtherprocess the set of results received from this group.

The query received during stage 2501 may exhibit varying degrees ofspecificity. For example, the query could request that the SmartRepository™ 2110 return information relating to all of the whole genomesequences included within the Smart Repository™ 2110 (or availablewithin other Smart Repositories included within a biological datanetwork in which the Smart Repository™ 2110 is also included) associatedwith a diagnosis of prostate cancer. Somewhat more specifically, thequery received during stage 2501 could request generation of a completelist of all of the genome sequence files 2144 (e.g., BAM files) andassociated ancillary metadata files 2142 (e.g., XML files) that havebeen submitted to the Smart Repository™ 2110 by a particular actor 2124(e.g., a genome sequencing center) during a certain time frame. In thiscase the scripted response 2160 to the query could comprise a list ofthose UUIDs representative of all such genome sequence files 2144 andassociated ancillary metadata files 2142.

In other cases the query received during stage 2501 may identify aparticular disease type. In this regard the query could requestinformation relating to the genome sequence files 2144 associated withtissue from individuals that have diagnosed with a certain disease typeand sequenced within a particular time period at a specific sequencingcenter. As a specific example of this case, the query could request allsequence files generated at the Broad sequencing center based uponpatients diagnosed with prostate cancer which were uploaded to a SmartRepository™ within a particular biological data network between Jun. 1,2011 and Aug. 31, 2011. In this case the expected would be a list ofthose UUIDs representative of all the genome sequence files and metadatafiles (e.g., XML files) relating to patients diagnosed with prostatecancer.

In a stage 2502, the SmartTracker™ 2118 evaluates the received query andidentifies all of the sequence data files 2144 and metadata files 2142within the Smart Repository™ 2110 encompassed by such query. TheSmartTracker™ 2118 may also identify all other sequence data files andmetadata files accessible within any biological data network(s) in whichthe Smart Repository™ 2110 is included that are requested by the query.In one embodiment the SmartTracker™ 2118 identifies the appropriategenome sequence data files and associated metadata files by tracking andparsing the attribute fields of such metadata files in accordance withthe received query. For example, in cases in which the received queryindicates an interest in sequence data files from individuals diagnosedwith prostate cancer which have been uploaded to a particular sequencingcenter during a particular time, the SmartTracker™ module 2118 wouldevaluate the attribute fields of all available metadata files relatingto these parameters and generate a list of UUIDs corresponding to themetadata files and the associated genome sequence data files (stage2503).

In one embodiment the list of UUIDs generated by the SmartTracker™module 2118 is filtered by the entitlement control module 2140 basedupon, for example, patient consent, subscription parameters and the like(stage 2503A). The filtered list of UUIDs produced by the entitlementcontrol module 2140, which comprises an initial, specific response tothe query received during stage 2501, is then sent to the requestingactor 2124 (stage 2504). In other embodiments the list of UUIDs is notfiltered prior to being provided to the requesting actor 2124. However,in this case the entitlement control module 2140 enforces conditionalaccess rules when the requesting actor 2124 attempts to download thegenome sequence data files or metadata files corresponding to any of thelisted UUID's. That is, the requesting actor 2124 is permitted todownload only those sequence-related or metadata files which theentitlement control module 2140 determines that such actor 2124 isentitled to access.

At stage 2504A, the requesting actor 2124 prompts the transcriptor 2128to initiate a process of generating a script of supplementaryinformation related to the subject matter of the initial, specificresponse provided to the requesting actor 2124 during stage 2504. In oneembodiment the requesting actor 2124 automatically prompts thetranscriptor 128 to initiate such process upon receiving the initial,specific response during stage 2504, provided the requesting actor 2124has expressed a preference to receive such an information script (eitheras part of the request received during stage 2501 or otherwise). Inother embodiments stage 2504A is initiated only after the requestingactor 2124 has received the initial, specific response during stage 2504and subsequently explicitly requested the script of supplementaryinformation.

In order to generate the script of supplementary information, thetranscriptor 2128 evaluates the query received during stage 2501 and theinitial, specific response delivered during stage 2504. Next, in a stage2505, the transcriptor 2128 initiates a script request by commencing aprocess for identifying a set of highest ranking attributes (HRAs)inherent within the metadata files returned as part of the initial,specific response during stage 2504. This could involve, for example,determining the relative frequency at which various attributes appearwithin the metadata files returned during stage 2504. For example, if aparticular attribute appears in every one of the metadata filessubsequently returned during stage 2504, then such attribute wouldlikely be included among the set of HRAs associated with thecorresponding query received during stage 2501. Based upon an evaluationof, for example, the relative frequency at which various attributesappear within the metadata files returned during stage 2504, a set ofHRAs may be determined by the transcriptor 2128. In certain embodimentsother considerations may bear upon whether a particular attribute isincluded among the set of HRAs corresponding to a given query. Forexample, the distribution of the particular attribute information as itrelates to the universally unique identifier (UUID) and the strength ofthe curated evidence associated with such attribute information may alsobe considered by the transcriptor 2128 when determining a set of HRAs.

In addition to parsing the metadata files returned during stage 2504 inorder to determine a set of HRAs, during stage 2506 the transcriptor2128 will typically also evaluate those metadata files which aregenerally encompassed by the query received during stage 2501 but whichwere not identified during stage 2503 because of other limitation orconstraints present within the query. For example, in the case in whicha query received during stage 2501 requests a list of sequence filesassociated with a diagnosis of prostate cancer which were uploaded tothe Smart Repository™ 2110 by a particular sequencing center during aparticular time window, the transcriptor 2128 may nonetheless evaluatethe attribute information included within those metadata filesassociated with a diagnosis of prostate cancer which are also associatedwith sequence information uploaded by other than the specifiedsequencing center and/or were uploaded outside of the particular timewindow. Similarly, the transcriptor may evaluate metadata filesassociated with a diagnosis of prostate cancer which are stored withinother Smart Repositories as part of the process of determining HRAscorresponding to a particular query. In this way the set of HRAs may bedetermined to include attributes not otherwise known to be associatedwith a disease type specified within a query (e.g., prostate cancer) butwhich in fact are highly correlated to known cases of such disease type.

FIG. 26 is a flowchart representative of an exemplary process 2650 forranking attributes appearing within the relevant metadata files in orderto identify a set of HRAs as contemplated during stage 2606. Referringto FIG. 26, in a stage 2654 at least one primary subject (e.g., prostatecancer) of the query received in stage 2501 (FIGS. 25A and 25B) isdetermined. All metadata files having one or more fields relating to anattribute relating to this primary subject are then identified (stage2660). For example, if the primary subject were determined to be“prostate cancer”, then all metadata files having the attribute of“prostate cancer” in a “disease type” field would be identified. Thiswill typically include identifying those metadata files having one ormore attributes within pertinent fields which relate to the primarysubject but which were excluded from the initial, specific queryresponse because of the presence of other filter parameters (e.g., timewindow during which sequence information was uploaded) within the queryreceived during stage 2501. A value score is then generated in a stage2664 for each field attribute of the metadata files identified duringstage 2660 based on, for example, the frequency and distribution of theattribute within the identified metadata files, the strength of thecurated field, and the relationship to disease diagnosis (i.e. PSAscore, other screens and clinical assays) and other rank order rules.The field attributes determined to be of relevance to the primarysubject are then ranked based upon these value scores and a set of HRAsare then selected based upon this ranking (stage 2668).

The HRAs may or may not be within a field identified by the queryreceived during stage 2501. In one embodiment ranking of the attributesduring stage 2668 determines which information fields and sequence datafiles, including those files identified in the initial, specificresponse to the received query, may be most relevant to the primarysubject of the query received during stage 2501. For example, in thecase in which the received query specifically requests a list ofsequence files relating to prostate cancer, all of the sequence filesidentified in the initial, specific response will be associated withmetadata files containing the attribute “prostate cancer” within the“disease type” field. However, there could be instances in which aspecific attribute of a metadata file such as, for example, blood PSAlevel, could be indicated to be significantly above a concentrationlevel associated with a high risk for prostate cancer (e.g., 4 ng/ml),but in which such metadata file does not include an attribute indicatinga diagnosis of prostate cancer and the associated individual exhibits noother symptoms of prostate cancer. Accordingly, even through a queryreceived during stage 2501 having a primary subject of “prostate cancer”may not have included “blood PSA level” as a parameter, a blood PSAlevel in excess of 4 ng/ml could be deemed to be an HRA with respect tosuch query because of its correlation with prostate cancer. As anotherexample, there could be cases associated with individuals which havedeveloped a prostate cancer tumor in which one attribute of a metadatafile relating to fluorescence in situ hybridization (FISH) reflects a“PTEN deletion” while another such attribute indicates a normal or lowPSA blood level. Such a PTEN deletion could then be determined by thetranscriptor 2128 to be an HRA. As a consequence, it would be expectedthat most of the genome sequence data files on the list of UUIDsreturned in a response to a query for which a PTEN deletion isdetermined to be an HRA will have metadata attributes indicative of sucha deletion.

The HRAs may be considered to be data points that are heavily weightedbased on a voting scheme and a set of rules that may be continuouslymodified based on new information and knowledge. In this regard, an HRAmay be considered to be specific to a particular query and the responseto such query.

Again referring to FIGS. 25A and 25B, in a stage 2507 the transcriptor2128 next identifies those metadata files 2142, and the metadata filesstored within other networked repositories, which include at least onehighly-ranked attribute and were not identified in the initial, specificresponse to the request received during stage 2501. For example, in theexemplary case in which the request relates to prostate cancer, suchmetadata files including at least one highly-ranked attribute couldinclude those having a PTEN deletion attribute. In this particularexemplary case, all metadata files that are associated with genomesequence files and contain within the disease field the term prostatecancer will be involved in the analysis process.

In the exemplary case in which the primary subject of query receivedduring stage 2501 was determined to be prostate cancer, at least all ofthe metadata files associated with individuals whom have been clinicallydiagnosed with prostate cancer by an oncologist will be evaluated by thetranscriptor 2128. In one embodiment each such file would include theattribute “prostate cancer” within a “disease type” field of the file.However, simply because an individual is not clinically diagnosed ashaving prostate cancer using traditional approaches does not mean thatsuch individual is not advancing towards this disease condition at themolecular level. As a consequence, sequence information derived fromanalytes produced from the tissue of such an individual may includecellular queues or biomarkers which contribute significantly to suchdisease condition and which therefore may be relevant to determiningwhether such condition exists in a given individual.

In a stage 2508, the transcriptor 2128 identifies those metadata fileswhich are associated with a primary subject of the query received duringstage 2501 (e.g., those metadata files having the attribute “prostatecancer” in the “disease type” field) but which do not include any of theHRAs determined based upon the received query request and the initial,specific response to such request. Again, in one embodiment the metadatafiles include the metadata files 2142 as well as the metadata filesstored within other repositories networked with the Smart Repository™2110. In this exemplary case it is assumed that when individuals havebeen diagnosed with a particular disease (e.g., prostate cancer), anattribute reflecting this diagnosis is included within a “disease type”or similar field in the applicable metadata file. Accordingly, thosepatients associated with metadata files identified during stage 2508 maybe considered to be “rare variants” in the sense that such patients havebeen diagnosed with a particular disease condition but exhibit none ofthe HRAs generally associated with such condition.

In a stage 2509, the transcriptor 2128 aggregates the files identifiedduring stages 2506, 2507 and 2508 (or the UUIDs corresponding to suchfiles) in order to form a script of supplementary information relatingto the query received during stage 2501 and the initial, specificresponse to the query provided during stage 2504. Once the script ofsupplementary information has been formed by the transcriptor 2128 it issent to the requesting actor 2124 (stage 2510).

The above-described script of supplementary information advantageouslyenables the requesting actor to access potentially obscure butnonetheless relevant information that was not explicitly requested norincluded within the initial, specific response to the request receivedduring stage 2501. Such advantageous results are facilitated by thedynamic characterization of the original query and subsequentstatistical correlation analysis of the various fields of the ancillarymetadata associated with the genome sequence data from the patient ofinterest carried out in the manner described above. However, otherapproaches to developing such a script of supplemental information basedupon an original query and/or initial query response may be apparent tothose skilled in the art in view of the teachings and exemplaryapproaches described herein. Moreover, it should be appreciated that theschema and process disclosed in FIGS. 25-26 is purely illustrative anddoes not in any way restrict the scope of the present disclosure. Forexample, the process could be reconfigured or redesigned in light of theteachings herein to leverage the features and functionality of a SmartRepository™ accessible to network users.

Request Processing in the Biological Data Network

Turning now to FIG. 27, a flowchart 2700 provides an overview of anexemplary manner in which network nodes 510 of the biological datanetwork 500 may cooperate to process a client request. In stage 2710, arequest is received from a client device at a first network node 510.Based upon the request, processing is performed at the first networknode based upon the request (stage 4102). In stage 2714, it isdetermined whether processing at the first network node is complete. Ifsuch processing is complete, then an appropriate response is returned tothe client (stage 2718). If not, the results of the processing at thefirst network node 510 may be routed or switched to a next network node510 selected or otherwise scheduled in accordance with the nature ofsuch processing results (stage 2720). In a stage 2722, processing isperformed at the next network node based upon the request (stage 2722).It is then determined whether processing at the next network node hasbeen completed (stage 2724). If such processing has been completed, aresponse is returned to the client (stage 2718); otherwise, some or allthe accumulated processing results may again be routed or switched to anext network node 510 stage 2720.

FIG. 28 is a flowchart representative of an exemplary sequence ofoperations involved in the identification and processing of sequencevariants at a network node 510. In stage 4010, a genome sequence (e.g.,a segment of the entire genome of an organism) associated with a requestissued by a user terminal or other client device is received at anetwork node 510. The genome sequence is then compared with a referencesequence at the network node (stage 2812). Through this comparisonsequence variants between the genome sequence and the reference sequenceare identified (stage 2816). In a stage 2820, a network location of adatabase containing information concerning at least a first of thesequence variants it is determined. Next, at least the first of thesequence variants is sent from the network node to the database (stage2822). In a stage 2826, information from the database relating to thefirst of the sequence variants is received at the network node (stage2826). A response is then sent from the network node to the userterminal based upon the information from the database (stage 2830).

Turning now to FIG. 29, a flowchart 2900 is provided of an exemplarysequence of operations carried out by network nodes 510 of thebiological data network in connection with processing of adisease-related query. In a stage 2910, a query relating to a specifieddisease and a genomic sequence associated with the query is received ata first network node 510 (stage 2910). Any variant alleles within thegenomic sequence are then identified relative to a control sequence(stage 2912). Next, information relating to the variant alleles is sentfrom the first network node to a second network node (stage 2916). In astage 2920, a statistical correlation analysis is performed at thesecond network node 510 in order to identify a set of the variantalleles included within genes associated with a specified disease (stage2920). Information relating to the set of variant alleles is thenreceived at the first network node (stage 2926). In a stage 2930, aresponse to the query is sent from the first network node 510 based uponthe information relating to the set of variant alleles (stage 2930).

Attention is now directed to FIG. 30, which is a flowchart 3000representative of an exemplary sequence of operations involved inproviding pharmacological response data in response to a user queryconcerning a specified disease. In a stage 3010, a query relating to aspecified disease and a genomic sequence associated with the query arereceived at a first network node 510. Next, any variant alleles withinthe genomic sequence are identified relative to a control sequence. In astage 3016, information relating to the variant alleles is sent from thefirst network node 510 to a second network node. A statisticalcorrelation analysis is then performed at the second network node inorder to identify those of the variant alleles included within genesassociated with a specified disease (stage 3020). At a third networknode 510, processing is performed to associate pharmacological responsedata with those of the variant alleles included within genes associatedwith the specified disease (stage 3022). Such pharmacological responseis sent from the third network node 510 and received at the firstnetwork node (stage 3026). A response to the query is then sent from thefirst network node to, for example, a client terminal based upon thepharmacological response data (stage 3030).

Referring now to FIG. 31, there is shown a flowchart 3100 representativeof the manner in which information relating to various different layersof biologically-relevant data organized consistently with the biologicaldata model 200 may be processed at different network nodes 510. In astage 3110, a request to process data comprised of at least a DNA layer210 and an RNA layer 220 is received at a first network node. Data inthe DNA layer is then processed in accordance with the request (stage3112). At least partial results of the processing of the data in the DNAlayer is then forwarded to a second network node (stage 3116). Datawithin the partial results is then processed at the second network nodewith respect to at least the RNA layer (stage 3120). A third networknode is then identified based upon the results of the processing at thesecond network node (stage 3122). The results of the processing at thesecond network node are then forwarded to the third network node, whichthen processes such results (stage 3126). The results of the processingperformed at the third network node are then sent and subsequentlyreceived at the first network node (stage 3130). A response to therequest is then sent from the first network node to, for example, aclient terminal based upon the results of the processing performed atthe third network node 510 (stage 3132).

The word “exemplary” is used herein to mean “serving as an example,instance, or illustration.” Any embodiment described herein as“exemplary” is not necessarily to be construed as preferred oradvantageous over other embodiments.

In one or more exemplary embodiments, the functions, methods andprocesses described may be implemented in hardware, software, firmware,or any combination thereof. If implemented in software, the functionsmay be stored on or encoded as one or more instructions or code on acomputer-readable medium. Computer-readable media includes computerstorage media. Storage media may be any available media that can beaccessed by a computer.

By way of example, and not limitation, such computer-readable media cancomprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage,magnetic disk storage or other magnetic storage devices, or any othermedium that can be used to carry or store desired program code in theform of instructions or data structures and that can be accessed by acomputer. Disk and disc, as used herein, includes compact disc (CD),laser disc, optical disc, digital versatile disc (DVD), floppy disk andblu-ray disc where disks usually reproduce data magnetically, whilediscs reproduce data optically with lasers. Combinations of the aboveshould also be included within the scope of computer-readable media.

It is understood that the specific order or hierarchy of steps or stagesin the processes and methods disclosed are examples of exemplaryapproaches. Based upon design preferences, it is understood that thespecific order or hierarchy of steps in the processes may be rearrangedwhile remaining within the scope of the present disclosure. Theaccompanying method claims present elements of the various steps in asample order, and are not meant to be limited to the specific order orhierarchy presented.

Those of skill in the art would understand that information and signalsmay be represented using any of a variety of different technologies andtechniques. For example, data, instructions, commands, information,signals, bits, symbols, and chips that may be referenced throughout theabove description may be represented by voltages, currents,electromagnetic waves, magnetic fields or particles, optical fields orparticles, or any combination thereof.

Those of skill would further appreciate that the various illustrativelogical blocks, modules, circuits, and algorithm steps described inconnection with the embodiments disclosed herein may be implemented aselectronic hardware, computer software, or combinations of both.

To clearly illustrate this interchangeability of hardware and software,various illustrative components, blocks, modules, circuits, and stepshave been described above generally in terms of their functionality.Whether such functionality is implemented as hardware or softwaredepends upon the particular application and design constraints imposedon the overall system.

Skilled artisans may implement the described functionality in varyingways for each particular application, but such implementation decisionsshould not be interpreted as causing a departure from the scope of thepresent disclosure.

The various illustrative logical blocks, modules, and circuits describedin connection with the embodiments disclosed herein may be implementedor performed with a general purpose processor, a digital signalprocessor (DSP), an application specific integrated circuit (ASIC), afield programmable gate array (FPGA) or other programmable logic device,discrete gate or transistor logic, discrete hardware components, or anycombination thereof designed to perform the functions described herein.A general purpose processor may be a microprocessor, but in thealternative, the processor may be any conventional processor,controller, microcontroller, or state machine. A processor may also beimplemented as a combination of computing devices, e.g., a combinationof a DSP and a microprocessor, a plurality of microprocessors, one ormore microprocessors in conjunction with a DSP core, or any other suchconfiguration. Additionally, the scope of the invention includeshardware not traditionally used or thought-of having use within generalpurpose computing, such as graphic processing units (GPUs).

The steps or stages of a method, process or algorithm described inconnection with the embodiments disclosed herein may be embodieddirectly in hardware, in a software module executed by a processor, orin a combination of the two. A software module may reside in RAM memory,flash memory, ROM memory, EPROM memory, EEPROM memory, registers, harddisk, a removable disk, a CD-ROM, or any other form of storage mediumknown in the art.

Certain of the disclosed methods may also be implemented using acomputer-readable medium containing program instructions which, whenexecuted by one or more processors, cause such processors to carry outoperations corresponding to the disclosed methods.

An exemplary storage medium is coupled to the processor such theprocessor can read information from, and write information to, thestorage medium. In the alternative, the storage medium may be integralto the processor. The processor and the storage medium may reside in anASIC. The ASIC may reside in a user terminal. In the alternative, theprocessor and the storage medium may reside as discrete components in auser terminal.

The previous description of the disclosed embodiments is provided toenable any person skilled in the art to make or use the presentdisclosure. Various modifications to these embodiments will be readilyapparent to those skilled in the art, and the generic principles definedherein may be applied to other embodiments without departing from thespirit or scope of the disclosure. Thus, the present disclosure is notintended to be limited to the embodiments shown herein but is to beaccorded the widest scope consistent with the principles and novelfeatures disclosed herein. It is intended that the following claims andtheir equivalents define the scope of the disclosure.

1. A method of conveying biological sequence data, comprising:generating a data packet including a first header containing networkrouting information, a second header containing header informationpertaining to the biological sequence data, and a payload containing arepresentation of the biological sequence data relative to a referencesequence; storing the data packet in a queue in communication with anetwork interface; and transmitting the data packet over a networkaccessible through the network interface.
 2. The method of claim 1wherein the biological sequence data comprises polymeric data.
 3. Themethod of claim 1 wherein the biological sequence data comprises DNAsequence data.
 4. The method of claim 3 wherein the header informationcomprises information relating to mutations within the DNA sequencedata.
 5. The method of claim 3 wherein the payload further includesembedded data relating to the DNA sequence data.
 6. The method of claim5 wherein the embedded data comprises correlative information relatingto mutations within the DNA sequence data.
 7. The method of claim 6wherein the correlative information includes pharmacologicalinformation.
 8. The method of claim 6 wherein the correlativeinformation includes clinical result information.
 9. The method of claim5 wherein the embedded data is represented within the payload in acompressed form.
 10. A method of receiving biological sequence data, themethod comprising: receiving, through a network interface of a networknode, a data packet including a first header containing network routinginformation, a second header containing header information pertaining tothe biological sequence data, and a payload containing a compressedversion of the biological sequence data; providing the data packet to aninput packet processor in communication with the network interface;extracting at least the compressed version of the biological sequencedata from the data packet; and storing the compressed version of thebiological sequence data within a memory of the network node.
 11. Themethod of claim 10 wherein the biological sequence data comprisespolymeric data.
 12. The method of claim 10 wherein the biologicalsequence data comprises DNA sequence data.
 13. The method of claim 12wherein the header information comprises [information relating tomutations within the DNA sequence data].
 14. The method of claim 12wherein the payload further includes embedded data relating to the DNAsequence data.
 15. The method of claim 14 wherein the embedded datacomprises correlative information relating to mutations within the DNAsequence data.
 16. The method of claim 15 wherein the correlativeinformation includes pharmacological information.
 17. The method ofclaim 15 wherein the correlative information includes clinical resultinformation.
 18. The method of claim 14 wherein the embedded data isrepresented within the payload in a compressed form.
 19. A network node,comprising: a network interface; a packet generator in communicationwith the network interface, the packet generator being configured togenerate a data packet including a first header containing networkrouting information, a second header containing header informationpertaining to the biological sequence data, and a payload containing arepresentation of the biological sequence data relative to a referencesequence; a queue in communication with the network interface, the datapacket being stored within the queue; and a transmit controller forcontrolling transmission of the data packet over a network accessiblethrough the network interface.
 20. The network node of claim 19 whereinthe biological sequence data comprises polymeric data.
 21. The networknode of claim 19 wherein the biological sequence data comprises DNAsequence data.
 22. The network node of claim 21 wherein the headerinformation comprises [information relating to mutations within the DNAsequence data].
 23. The network node of claim 21 wherein the payloadfurther includes embedded data relating to the DNA sequence data. 24.The network node of claim 23 wherein the embedded data comprisescorrelative information relating to mutations within the DNA sequencedata.
 25. The network node of claim 24 wherein the correlativeinformation includes pharmacological information.
 26. The network nodeof claim 24 wherein the correlative information includes clinical resultinformation.
 27. The network node of claim 23 wherein the embedded datais represented within the payload in a compressed form.
 28. A networknode, comprising: a network interface; an input packet processor incommunication with the network interface, the input packet processorbeing configured to receive a data packet and extract at least acompressed version of biological sequence data from the data packetwherein the data packet includes a first header containing networkrouting information, a second header containing header informationpertaining to the biological sequence data, and a payload containing thecompressed version of the biological sequence data; and a memory inwhich is stored the compressed version of the biological sequence data.29. The network node of claim 28 wherein the biological sequence datacomprises polymeric data.
 30. The network node of claim 28 wherein thebiological sequence data comprises DNA sequence data.
 31. The networknode of claim 30 wherein the header information comprises [informationrelating to mutations within the DNA sequence data].
 32. The networknode of claim 30 wherein the payload further includes embedded datarelating to the DNA sequence data.
 33. The network node of claim 32wherein the embedded data comprises correlative information relating tomutations within the DNA sequence data.
 34. The network node of claim 33wherein the correlative information includes pharmacologicalinformation.
 35. The network node of claim 33 wherein the correlativeinformation includes clinical result information.
 36. The network nodeof claim 32 wherein the embedded data is represented within the payloadin a compressed form.